I have a method which takes an OrderedSet X of objects of type A, and an OrderedSet Y of OrderedSets of objects of type B. (Nested)
This method then returns a new OrderedSet Z of Edges based on the two sets given.
Basically, I give the method two sets, and the method gives me back a connection much like the mathematical definition of a function.
So if I want to make a bijective connection, I would have to ensure that both sets have equal size, and I do not want null objects to be present anywhere.
The problem is, these sets are going to be arbitrarily large, and what I would like to do is this :
Ensure that nothing in these sets will ever be null
Ensure that every set has the required size
What I have done to obtain what I want :
I implemented the OrderedSet as a HashSet with extra properties, and I simply check for null elements using Contains which is O(1) (I am not entirely sure if this is a good solution)
OrderedSet refuses to add null objects, but this does not change the possibility of changing the elements within the set once they are added, and setting them to null this way
I tried taking the easy way out and nested the contents of the method in a try-catch, so if something goes wrong I simply catch the error and move on, rather than having to first validate all the data passed (Assume the sets might be very huge, there is no limitation on their size) The issue with this is that it might fail at the very end, wasting computation time
I also tried making a brute force check, so basically checking every set for nulls, including the sets within the second parameter, and also checking every single set for its correct size to be expected. This would work in theory, but I feel is impractical and surely there could be a more clever solution to this problem.
What I have considered :
I have looked into Contracts, but this article shows a significant (in my opinion) decrease in performance, although it does accomplish what I am looking for (Yet still in a hacky sort of way)
I have read about non-nullable reference types that are supposed to come out in C# 8.0, which might solve the problem of having to check for null elements, but I would still be left with the issue of having to check for all the sizes of the sets involved, each time I want to make a new connection.
My goal :
To have a readable, yet efficient solution to parameter validation.
Thank you for your time, please feel free to correct me if I have said something that is not quite correct.
Related
I have a Dictionary (of Long, Class), where Class has multiple properties (assume we have a property called Updated as Boolean).
I want to update this (Updated) property to (True) at once for let's say all Odd key records (or based on any specific rule). What is the best way to do so?
My thoughts are to use Linq to fetch those records then (for each) them, but is there any better way to do so like doing a mass update where a condition happens (like what we do in the database)?
An example of my approach is below. Appreciate it if there is a better way to do such an update...
Thanks
Dim ReturnedObjs = From Obj In Dictionary Where Obj.Key Mod 2 = 1
For Each item As KeyValuePair(Of Long, Class) In ReturnedObjs
item.Value.Updated = True
Next
First, this sounds like a obvious case for the speed rant:
https://ericlippert.com/2012/12/17/performance-rant/
Second:
The best way is to keep this in the Database. You are not going to beat the speed of a DB Query with Indexes designed for quick matching, by transfering the data over the network twice (once to get it, once to return it) and doubling the search load (once to get all odd ones, once to update all the ones you just changed). My standing advice is to always keep as much work as possible on the DB side. Your client code will never be able to beat it.
Third:
If you do need to use client side processing:
Now a lot of my answer depend on details of the implementation, how the JiT and general Compiler optimsiations work, etc.
Foreach uses works on enumerators, not Collections. But if you feed a collection to foreaach, a Enumerator is implicitly created. Now enumerators do have two properties:
If the collection changes, the Enumerator becomes invalid. Most people learn about them because they ran into this issue.
It is a extra function call and set of checks for accessing a collection. So it will be a slowdown. How much is hard to say, as the Optimisations and JiT are pretty good.
So you propably want to use for loop instead.
If you could turn the Dictionary into a collection where the Primary Key is used as Index, it might be a bit faster. But hat has the danger of running into a lot of "dry spells" regarding data, so it depends a lot on your source data.
If i have public method that returns a reference type value, which is private field in the current class, do i need to return a copy of it? In my case i need to return List, but this method is called very often and my list holds ~100 items. The point is that if i return the same variable, everybody can modify it, but if i return a copy, the performance will degrade. In my case im trying to generate sudoku table, which is not fast procedure.
Internal class SudokuTable holds the values with their possible values. Public class SudokuGame handles UI requests and generates/solves SudokuTable. Is it good practice to chose performance instead OOP principles? If someone wants to make another library using my SudokuTable class, he wont be aware that he can brake its state with modifying the List that it returns.
Performance and object-oriented programming are not mutually exclusive - your code can be object-oriented and perform badly, etc.
In the case you state here I don't think it would be wise to allow external parts edit the internal state of a thing, so I would return an array or ReadOnlyCollection of the entries (it could be a potential possibility to use an ObservableCollection and monitor for tampering out-of-bounds, and 'handling' that accordingly (say, with an exception or something) - unsure how desirable this would be).
From there, you might consider how you expose access to these entries, trying to minimise the need for callers to get the full collection when all they need is to look up and return a specific one.
It's worth noting that an uneditable collection doesn't necessarily mean the state cannot be altered, either; if the entries are represented by a reference type rather than a value type then returning an entry leaves that open to tampering (potentially, depending on the class definition), so you might be better off with structs for the entry types.
At length, this, without a concrete example of where you're having problems, is a bit subjective and theoretical at the moment. Have you tried restricting the collection? And if so, how was the performance? Where were the issues? And so on.
I am trying to optimize my code and was running VS performance monitor on it.
It shows that simple assignment of float takes up a major chunk of computing power?? I don't understand how is that possible.
Here is the code for TagData:
public class TagData
{
public int tf;
public float tf_idf;
}
So all I am really doing is:
float tag_tfidf = td.tf_idf;
I am confused.
I'll post another theory: it might be the cache miss of the first access to members of td. A memory load takes 100-200 cycles which in this case seems to amount to about 1/3 of the total duration of the method.
Points to test this theory:
Is your data set big? It bet it is.
Are you accessing the TagData's in random memory order? I bet they are not sequential in memory. This causes the memory prefetcher of the CPU to be dysfunctional.
Add a new line int dummy = td.tf; before the expensive line. This new line will now be the most expensive line because it will trigger the cache miss. Find some way to do a dummy load operation that the JIT does not optimize out. Maybe add all td.tf values to a local and pass that value to GC.KeepAlive at the end of the method. That should keep the memory load in the JIT-emitted x86.
I might be wrong but contrary to the other theories so far mine is testable.
Try making TagData a struct. That will make all items of term.tags sequential in memory and give you a nice performance boost.
Are you using LINQ? If so, LINQ uses lazy enumeration so the first time you access the value you pulled out, it's going to be painful.
If you are using LINQ, call ToList() after your query to only pay the price once.
It also looks like your data structure is sub optimal but since I don't have access to your source (and probably couldn't help even if I did :) ), I can't tell you what would be better.
EDIT: As commenters have pointed out, LINQ may not be to blame; however my question is based on the fact that both foreach statements are using IEnumerable. The TagData assignment is a pointer to the item in the collection of the IEnumerable (which may or may not have been enumerated yet). The first access of legitimate data is the line that pulls the property from the object. The first time this happens, it may be executing the entire LINQ statement and since profiling uses the average, it may be off. The same can be said for tagScores (which I'm guessing is database backed) whose first access is really slow and then speeds up. I wasn't pointing out the solution just a possible problem given my understanding of IEnumerable.
See http://odetocode.com/blogs/scott/archive/2008/10/01/lazy-linq-and-enumerable-objects.aspx
As we can see that next line to the suspicious one takes only 0.6 i.e
float tag_tfidf = td.tf_idf;//29.6
string tagName =...;//0.6
I suspect this is caused bu the excessive number of calls, and also note float is a value type, meaning they are copied by value. So everytime you assign it, runtime creates new float (Single) struct and initializes it by copying the value from td.tf_idf which takes huge time.
You can see string tagName =...; doesn't takes much because it is copied by reference.
Edit: As comments pointed out I may be wrong in that respect, this might be a bug in profiler also, Try re profiling and see if that makes any difference.
Is there a convention for whether or not to use a property to calculate a value on call? For instance if my class contains a list of integers and I have a property Average, the average will possibly change when an integer is added/removed/modified from the list, does doing something like this:
private int? _ave = null;
public int Average
{
get
{
if (_ave == null )
{
double accum = 0;
foreach (int i in myList)
{
accum += i;
}
_ave = accum / myList.Count;
return (int)_ave;
}
else
{
return (int)_ave;
}
}
}
where _ave is set to null if myList is modified in a way that may change the average...
Have any conventional advantage/disadvantage over a method call to average?
I am basically just wondering what the conventions are for this, as I am creating a class that has specific properties that may only be calculated once. I like the idea of the classes that access these properties to be able to access the property vs. a method (as it seems more readable IMO, to treat something like average as a property rather than a method), but I can see where this might get convoluted, especially in making sure that _ave is set to null appropriately.
The conventions are:
If the call is going to take significantly more time than simply reading a field and copying the value in it, make it a method. Properties should be fast.
If the member represents an action or an ability of the class, make it a method.
If the call to the getter mutates state, make it a method. Properties are invoked automatically in the debugger, and it is extremely confusing to have the debugger introducing mutations in your program as you debug it.
If the call is not robust in the face of being called at unusual times then make it a method. Properties need to continue to work when in used in constructors and finalizers, for example. Again, think about the debugger; if you are debugging a constructor then it should be OK for you to examine a property in the debugger even if it has not actually been initialized yet.
If the call can fail then make it a method. Properties should not throw exceptions.
In your specific case, it is borderline. You are performing a potentially lengthy operation the first time and then caching the result, so the amortized time is likely to be very fast even if the worst-case time is slow. You are mutating state, but again, in quite a non-destructive way. It seems like you could characterize it as a property of a set rather than an "ability" of the set. I would personally be inclined to make this a method but I would not push back very hard if you had a good reason to make it a property.
Regarding your specific implementation: I would be much more inclined to use a 64 bit integer as the accumulator rather than a 64 bit double; the double only has 53 bits of integer precision compared to the 64 bits of a long.
Microsoft's recommendation to using methods:
Use method
If calling has side effects
If it returns different values each calls
If it takes long time to call
If operation requires parameters (except indexers)
Use property if calculated value is attribute of object.
In your case I think property with implicit lazy calculation would be good choice.
Yes there is... a get accessor should not in any way modify the state of the object. The returned value could be calculated of course, and you might have a ton of code in there. But simply accessing a value should not affect the state of the containing instance at all.
In this particular case, why not calculate everything upon construction of the class instance instead? Or provide a dedicated method to force the class to do so.
Now I suppose there might be very specific situations where that sort of behavior is OK. This might be one of those. But without seeing the rest of the code (and the way it is used), it's impossible to tell.
I'm creating a repository in EF4. For one of the methods a password and username is used to verify a user. The method returns a count of users so a 0 means they don't exist and a 1 they do. Would it make much of a difference if I just returned a user object and checked it for null?
Technically the most efficient way would probably be to use the Any() extension method. If you return an object there is the cost of filling that object. If you return a count, then there is the cost of going through every record (after the where clause has been applied) and counting them. Any() should use Exists in sql, and therefore, SQL server can stop as soon as it finds the first record.
Ultimately though, I agree with the others, this isn't a place you want to start optimizing right away. Donald Knuth probably has the best quote about this:
"We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil".
For instance, let's say you have this method return a bool and you use the Any() method. Later in the request, you might need to pull the user object out of the database (this could be something you end up doing a lot). Now, by optimizing early, you've actually increased the number of calls to the database.
HTH
well the option with Any is going to be better because EF has a high cost of materialization and change tracking for an object and if that object happens to have lot of properties, you should definitely consider using Any.
The second version would be better - in terms of design. In terms of microefficiency it shouldn't matter
Hoakey if you want to know the truth. Both methods are going to be so negligible that it wont matter which one you choose. Choose whichever method makes your code easier to understand and read. Often times people get worried about performance in all the wrong places.
I agree with Armen, return the object and check for null. Very simple and is easy to understand what is going on.
If you don't need any data from the 'user' table after you verify that a valid user/password combo exists, then either method will work (and performance won't matter).
On the other hand, if once you verify valid username/password you plan on making a second call to get the user details, then clearly returning the object in the first place(and checking for null to verify existence) is a more efficient strategy in my opinion.