Passing a complex data vs passing its property - c#

I know this is a basic question but I kind of wanted to know some key details in their differences and how does it affects performance !
Below is a C# example.
Suppose I have a Person class with properties like
public class Person
{
int PersonIds{get;set;}
string Name{get;set;}
...
...
}
and I have defined somewhere
List<Person> myFamily; // I later initialize to contail all my family members.
Now I have 2 functions fun1 and fun2
List<Person> fun1(List<Person> myFamily)
{
... // here some logic occurs and I get some less no. of Person in return list.
...
}
somewhere else
List<Person> selectedPersons = fun1(myFamily);
VS
List<int> fun2(List<int> myFamilyPersonIds)
{
... // Here same logic occurs as fun1 but it only needs personID to perform it.
...
}
somewhere else
List<int> selectedPersonIds = fun2(myFamilyPersonIds);
List<Person> selectedPersons = myFamily.where(a=>selectedPersonIds.contains(a.PersonId));
I want to know in what ways does this effect the performance.
Tips and suggestions are also welcome.

The only way to know the correct answer is to use a tool to profile the code, and see where your program is really spending it's time. Before this, you're just doing guess-work and pre-mature optimization.
However, assuming you already have these objects constructed, I tend to prefer the fun1() option. This is because it's passing around references to the same objects in memory. The fun2() option needs to make copies of the integer IDs. As a reference is the same size as an integer, I'd expect copying integers to be about the same amount of work as copying references. That part is a wash.
However, by staying with references you can save the later step of finding the whole object based on the ID. That should simplify your code, making it easier to read and maintain, and save work for the computer (improve performance), too.
Also, imagine for a moment that your property were something larger than an integer... say, a string. In this case, passing the object could be a huge performance win. But again, what I expect is just a naive guess without results from a real profiling tool.

Related

Programmatically check for a change in a class in C#

Is there a way to check for the size of a class in C#?
My reason for asking is:
I have a routine that stores a class's data in a file, and a different routine that loads this object (class) from that same file. Each attribute is stored in a specific order, and if you change this class you have to be reminded of these export/import routines needs changing.
An example in C++ (no matter how clumsy or bad programming this might be) would be
the following:
#define PERSON_CLASS_SIZE 8
class Person
{
char *firstName;
}
...
bool ExportPerson(Person p)
{
if (sizeof(Person) != PERSON_CLASS_SIZE )
{
CatastrophicAlert("You have changed the Person class and not fixed this export routine!")
}
}
Thus before compiletime you need to know the size of Person, and modify export/import routines with this size accordingly.
Is there a way to do something similar to this in C#, or are there other ways of "making sure" a different developer changes import/export routines if he changes a class.
... Apart from the obvious "just comment this in the class, this guarantees that a developer never screws things up"-answer.
Thanks in advance.
Each attribute is stored in a specific order, and if you change this class you have to be reminded of these export/import routines needs changing.
It sounds like you're writing your own serialization mechanism. If that's the case, you should probably include some sort of "fingerprint" of the expected properties in the right order, and validate that at read time. You can then include the current fingerprint in a unit test, which will then fail if a property is added. The appropriate action can then be taken (e.g. migrating existing data) and the unit test updated.
Just checking the size of the class certainly wouldn't find all errors - if you added one property and deleted one of the same size in the same change, you could break data without noticing it.
A part from the fact that probably is not the best way to achieve what you need,
I think the fastest way is to use Cecil. You can get the IL body of the entire class.

In memory representation of large data

Currently, I am working on a project where I need to bring GBs of data on to client machine to do some task and the task needs whole data as it do some analysis on the data and helps in decision making process.
so the question is, what are the best practices and suitable approach to manage that much amount of data into memory without hampering the performance of client machine and application.
note: at the time of application loading, we can spend time to bring data from database to client machine, that's totally acceptable in our case. but once the data is loaded into application at start up, performance is very important.
This is a little hard to answer without a problem statement, i.e. what problems you are currently facing, but the following is just some thoughts, based on some recent experiences we had in a similar scenario. It is, however, a lot of work to change to this type of model - so it also depends how much you can invest trying to "fix" it, and I can make no promise that "your problems" are the same as "our problems", if you see what I mean. So don't get cross if the following approach doesn't work for you!
Loading that much data into memory is always going to have some impact, however, I think I see what you are doing...
When loading that much data naively, you are going to have many (millions?) of objects and a similar-or-greater number of references. You're obviously going to want to be using x64, so the references will add up - but in terms of performance the biggesst problem is going to be garbage collection. You have a lot of objects that can never be collected, but the GC is going to know that you're using a ton of memory, and is going to try anyway periodically. This is something I looked at in more detail here, but the following graph shows the impact - in particular, those "spikes" are all GC killing performance:
For this scenario (a huge amount of data loaded, never released), we switched to using structs, i.e. loading the data into:
struct Foo {
private readonly int id;
private readonly double value;
public Foo(int id, double value) {
this.id = id;
this.value = value;
}
public int Id {get{return id;}}
public double Value {get{return value;}}
}
and stored those directly in arrays (not lists):
Foo[] foos = ...
the significance of that is that because some of these structs are quite big, we didn't want them copying themselves lots of times on the stack, but with an array you can do:
private void SomeMethod(ref Foo foo) {
if(foo.Value == ...) {blah blah blah}
}
// call ^^^
int index = 17;
SomeMethod(ref foos[index]);
Note that we've passed the object directly - it was never copied; foo.Value is actually looking directly inside the array. The tricky bit starts when you need relationships between objects. You can't store a reference here, as it is a struct, and you can't store that. What you can do, though, is store the index (into the array). For example:
struct Customer {
... more not shown
public int FooIndex { get { return fooIndex; } }
}
Not quite as convenient as customer.Foo, but the following works nicely:
Foo foo = foos[customer.FooIndex];
// or, when passing to a method, SomeMethod(ref foos[customer.FooIndex]);
Key points:
we're now using half the size for "references" (an int is 4 bytes; a reference on x64 is 8 bytes)
we don't have several-million object headers in memory
we don't have a huge object graph for GC to look at; only a small number of arrays that GC can look at incredibly quickly
but it is a little less convenient to work with, and needs some initial processing when loading
additional notes:
strings are a killer; if you have millions of strings, then that is problematic; at a minimum, if you have strings that are repeated, make sure you do some custom interning (not string.Intern, that would be bad) to ensure you only have one instance of each repeated value, rather than 800,000 strings with the same contents
if you have repeated data of finite length, rather than sub-lists/arrays, you might consider a fixed array; this requires unsafe code, but avoids another myriad of objects and references
As an additional footnote, with that volume of data, you should think very seriously about your serialization protocols, i.e. how you're sending the data down the wire. I would strongly suggest staying far away from things like XmlSerializer, DataContractSerializer or BinaryFormatter. If you want pointers on this subject, let me know.

Is there a LinkedList collection that supports dictionary type operations

I was recently profiling an application trying to work out why certain operations were extremely slow. One of the classes in my application is a collection based on LinkedList. Here's a basic outline, showing just a couple of methods and some fluff removed:
public class LinkInfoCollection : PropertyNotificationObject, IEnumerable<LinkInfo>
{
private LinkedList<LinkInfo> _items;
public LinkInfoCollection()
{
_items = new LinkedList<LinkInfo>();
}
public void Add(LinkInfo item)
{
_items.AddLast(item);
}
public LinkInfo this[Guid id]
{ get { return _items.SingleOrDefault(i => i.Id == id); } }
}
The collection is used to store hyperlinks (represented by the LinkInfo class) in a single list. However, each hyperlink also has a list of hyperlinks which point to it, and a list of hyperlinks which it points to. Basically, it's a navigation map of a website. As this means you can having infinite recursion when links go back to each other, I implemented this as a linked list - as I understand it, it means for every hyperlink, no matter how many times it is referenced by another hyperlink, there is only ever one copy of the object.
The ID property in the above example is a GUID.
With that long winded description out the way, my problem is simple - according to the profiler, when constructing this map for a fairly small website, the indexer referred to above is called no less than 27906 times. Which is an extraordinary amount. I still need to work out if it's really necessary to be called that many times, but at the same time, I would like to know if there's a more efficient way of doing the indexer as this is the primary bottleneck identified by the profiler (also assuming it isn't lying!). I still needed the linked list behaviour as I certainly don't want more than one copy of these hyperlinks floating around killing my memory, but I also do need to be able to access them by a unique key.
Does anyone have any advice to offer on improving the performance of this indexer. I also have another indexer which uses a URI rather than a GUID, but this is less problematic as the building incoming/outgoing links is done by GUID.
Thanks;
Richard Moss
You should use a Dictionary<Guid, LinkInfo>.
You don't need to use LinkedList in order to have only one copy of each LinkInfo in memory. Remember that LinkInfo is a managed reference type, and so you can place it in any collection, and it'll just be a reference to the object that gets placed in the list, not a copy of the object itself.
That said, I'd implement the LinkInfo class as containing two lists of Guids: one for the things this links to, one for the things linking to this. I'd have just one Dictionary<Guid, LinkInfo> to store all the links. Dictionary is a very fast lookup, I think that'll help with your performance.
The fact that this[] is getting called 27,000 times doesn't seem like a big deal to me, but what's making it show up in your profiler is probably the SingleOrDefault call on the LinkedList. Linked lists are best for situations where you need fast insertions & removals, particularly in the middle of the list. For quick lookups, which is probably more important here, let the Dictionary do its work with hash tables.

Is it a good idea to create a custom type for the primary key of each data table?

We have a lot of code that passes about “Ids” of data rows; these are mostly ints or guids. I could make this code safer by creating a different struct for the id of each database table. Then the type checker will help to find cases when the wrong ID is passed.
E.g the Person table has a column calls PersonId and we have code like:
DeletePerson(int personId)
DeleteCar(int carId)
Would it be better to have:
struct PersonId
{
private int id;
// GetHashCode etc....
}
DeletePerson(PersionId persionId)
DeleteCar(CarId carId)
Has anyone got real life experience
of dong this?
Is it worth the overhead?
Or more pain then it is worth?
(It would also make it easier to change the data type in the database of the primary key, that is way I thought of this ideal in the first place)
Please don’t say use an ORM some other big change to the system design as I know an ORM would be a better option, but that is not under my power at present. However I can make minor changes like the above to the module I am working on at present.
Update:
Note this is not a web application and the Ids are kept in memory and passed about with WCF, so there is no conversion to/from strings at the edge. There is no reason that the WCF interface can’t use the PersonId type etc. The PersonsId type etc could even be used in the WPF/Winforms UI code.
The only inherently "untyped" bit of the system is the database.
This seems to be down to the cost/benefit of spending time writing code that the compiler can check better, or spending the time writing more unit tests. I am coming down more on the side of spending the time on testing, as I would like to see at least some unit tests in the code base.
It's hard to see how it could be worth it: I recommend doing it only as a last resort and only if people are actually mixing identifiers during development or reporting difficulty keeping them straight.
In web applications in particular it won't even offer the safety you're hoping for: typically you'll be converting strings into integers anyway. There are just too many cases where you'll find yourself writing silly code like this:
int personId;
if (Int32.TryParse(Request["personId"], out personId)) {
this.person = this.PersonRepository.Get(new PersonId(personId));
}
Dealing with complex state in memory certainly improves the case for strongly-typed IDs, but I think Arthur's idea is even better: to avoid confusion, demand an entity instance instead of an identifier. In some situations, performance and memory considerations could make that impractical, but even those should be rare enough that code review would be just as effective without the negative side-effects (quite the reverse!).
I've worked on a system that did this, and it didn't really provide any value. We didn't have ambiguities like the ones you're describing, and in terms of future-proofing, it made it slightly harder to implement new features without any payoff. (No ID's data type changed in two years, at any rate - it's could certainly happen at some point, but as far as I know, the return on investment for that is currently negative.)
I wouldn't make a special id for this. This is mostly a testing issue. You can test the code and make sure it does what it is supposed to.
You can create a standard way of doing things in your system than help future maintenance (similar to what you mention) by passing in the whole object to be manipulated. Of course, if you named your parameter (int personID) and had documentation then any non malicious programmer should be able to use the code effectively when calling that method. Passing a whole object will do that type matching that you are looking for and that should be enough of a standardized way.
I just see having a special structure made to guard against this as adding more work for little benefit. Even if you did this, someone could come along and find a convenient way to make a 'helper' method and bypass whatever structure you put in place anyway so it really isn't a guarantee.
You can just opt for GUIDs, like you suggested yourself. Then, you won't have to worry about passing a person ID of "42" to DeleteCar() and accidentally delete the car with ID of 42. GUIDs are unique; if you pass a person GUID to DeleteCar in your code because of a programming typo, that GUID will not be a PK of any car in the database.
You could create a simple Id class which can help differentiate in code between the two:
public class Id<T>
{
private int RawValue
{
get;
set;
}
public Id(int value)
{
this.RawValue = value;
}
public static explicit operator int (Id<T> id) { return id.RawValue; }
// this cast is optional and can be excluded for further strictness
public static implicit operator Id<T> (int value) { return new Id(value); }
}
Used like so:
class SomeClass
{
public Id<Person> PersonId { get; set; }
public Id<Car> CarId { get; set; }
}
Assuming your values would only be retrieved from the database, unless you explicitly cast the value to an integer, it is not possible to use the two in each other's place.
I don't see much value in custom checking in this case. You might want to beef up your testing suite to check that two things are happening:
Your data access code always works as you expect (i.e., you aren't loading inconsistent Key information into your classes and getting misuse because of that).
That your "round trip" code is working as expected (i.e., that loading a record, making a change and saving it back isn't somehow corrupting your business logic objects).
Having a data access (and business logic) layer you can trust is crucial to being able to address the bigger pictures problems you will encounter attempting to implement the actual business requirements. If your data layer is unreliable you will be spending a lot of effort tracking (or worse, working around) problems at that level that surface when you put load on the subsystem.
If instead your data access code is robust in the face of incorrect usage (what your test suite should be proving to you) then you can relax a bit on the higher levels and trust they will throw exceptions (or however you are dealing with it) when abused.
The reason you hear people suggesting an ORM is that many of these issues are dealt with in a reliable way by such tools. If your implementation is far enough along that such a switch would be painful, just keep in mind that your low level data access layer needs to be as robust as an good ORM if you really want to be able to trust (and thus forget about to a certain extent) your data access.
Instead of custom validation, your testing suite could inject code (via dependency injection) that does robust tests of your Keys (hitting the database to verify each change) as the tests run and that injects production code that omits or restricts such tests for performance reasons. Your data layer will throw errors on failed keys (if you have your foreign keys set up correctly there) so you should also be able to handle those exceptions.
My gut says this just isn't worth the hassle. My first question to you would be whether you actually have found bugs where the wrong int was being passed (a Car ID instead of a Person ID in your example). If so, it is probably more of a case of worse overall architecture in that your Domain objects have too much coupling, and are passing too many arguments around in method parameters rather than acting on internal variables.

Can I detect whether I've been given a new object as a parameter?

Short Version
For those who don't have the time to read my reasoning for this question below:
Is there any way to enforce a policy of "new objects only" or "existing objects only" for a method's parameters?
Long Version
There are plenty of methods which take objects as parameters, and it doesn't matter whether the method has the object "all to itself" or not. For instance:
var people = new List<Person>();
Person bob = new Person("Bob");
people.Add(bob);
people.Add(new Person("Larry"));
Here the List<Person>.Add method has taken an "existing" Person (Bob) as well as a "new" Person (Larry), and the list contains both items. Bob can be accessed as either bob or people[0]. Larry can be accessed as people[1] and, if desired, cached and accessed as larry (or whatever) thereafter.
OK, fine. But sometimes a method really shouldn't be passed a new object. Take, for example, Array.Sort<T>. The following doesn't make a whole lot of sense:
Array.Sort<int>(new int[] {5, 6, 3, 7, 2, 1});
All the above code does is take a new array, sort it, and then forget it (as its reference count reaches zero after Array.Sort<int> exits and the sorted array will therefore be garbage collected, if I'm not mistaken). So Array.Sort<T> expects an "existing" array as its argument.
There are conceivably other methods which may expect "new" objects (though I would generally think that to have such an expectation would be a design mistake). An imperfect example would be this:
DataTable firstTable = myDataSet.Tables["FirstTable"];
DataTable secondTable = myDataSet.Tables["SecondTable"];
firstTable.Rows.Add(secondTable.Rows[0]);
As I said, this isn't a great example, since DataRowCollection.Add doesn't actually expect a new DataRow, exactly; but it does expect a DataRow that doesn't already belong to a DataTable. So the last line in the code above won't work; it needs to be:
firstTable.ImportRow(secondTable.Rows[0]);
Anyway, this is a lot of setup for my question, which is: is there any way to enforce a policy of "new objects only" or "existing objects only" for a method's parameters, either in its definition (perhaps by some custom attributes I'm not aware of) or within the method itself (perhaps by reflection, though I'd probably shy away from this even if it were available)?
If not, any interesting ideas as to how to possibly accomplish this would be more than welcome. For instance I suppose if there were some way to get the GC's reference count for a given object, you could tell right away at the start of a method whether you've received a new object or not (assuming you're dealing with reference types, of course--which is the only scenario to which this question is relevant anyway).
EDIT:
The longer version gets longer.
All right, suppose I have some method that I want to optionally accept a TextWriter to output its progress or what-have-you:
static void TryDoSomething(TextWriter output) {
// do something...
if (output != null)
output.WriteLine("Did something...");
// do something else...
if (output != null)
output.WriteLine("Did something else...");
// etc. etc.
if (output != null)
// do I call output.Close() or not?
}
static void TryDoSomething() {
TryDoSomething(null);
}
Now, let's consider two different ways I could call this method:
string path = GetFilePath();
using (StreamWriter writer = new StreamWriter(path)) {
TryDoSomething(writer);
// do more things with writer
}
OR:
TryDoSomething(new StreamWriter(path));
Hmm... it would seem that this poses a problem, doesn't it? I've constructed a StreamWriter, which implements IDisposable, but TryDoSomething isn't going to presume to know whether it has exclusive access to its output argument or not. So the object either gets disposed prematurely (in the first case), or doesn't get disposed at all (in the second case).
I'm not saying this would be a great design, necessarily. Perhaps Josh Stodola is right and this is just a bad idea from the start. Anyway, I asked the question mainly because I was just curious if such a thing were possible. Looks like the answer is: not really.
No, basically.
There's really no difference between:
var x = new ...;
Foo(x);
and
Foo(new ...);
and indeed sometimes you might convert between the two for debugging purposes.
Note that in the DataRow/DataTable example, there's an alternative approach though - that DataRow can know its parent as part of its state. That's not the same thing as being "new" or not - you could have a "detach" operation for example. Defining conditions in terms of the genuine hard-and-fast state of the object makes a lot more sense than woolly terms such as "new".
Yes, there is a way to do this.
Sort of.
If you make your parameter a ref parameter, you'll have to have an existing variable as your argument. You can't do something like this:
DoSomething(ref new Customer());
If you do, you'll get the error "A ref or out argument must be an assignable variable."
Of course, using ref has other implications. However, if you're the one writing the method, you don't need to worry about them. As long as you don't reassign the ref parameter inside the method, it won't make any difference whether you use ref or not.
I'm not saying it's good style, necessarily. You shouldn't use ref or out unless you really, really need to and have no other way to do what you're doing. But using ref will make what you want to do work.
No. And if there is some reason that you need to do this, your code has improper architecture.
Short answer - no there isn't
In the vast majority of cases I usually find that the issues that you've listed above don't really matter all that much. When they do you could overload a method so that you can accept something else as a parameter instead of the object you are worried about sharing.
// For example create a method that allows you to do this:
people.Add("Larry");
// Instead of this:
people.Add(new Person("Larry"));
// The new method might look a little like this:
public void Add(string name)
{
Person person = new Person(name);
this.add(person); // This method could be private if neccessary
}
I can think of a way to do this, but I would definitely not recommend this. Just for argument's sake.
What does it mean for an object to be a "new" object? It means there is only one reference keeping it alive. An "existing" object would have more than one reference to it.
With this in mind, look at the following code:
class Program
{
static void Main(string[] args)
{
object o = new object();
Console.WriteLine(IsExistingObject(o));
Console.WriteLine(IsExistingObject(new object()));
o.ToString(); // Just something to simulate further usage of o. If we didn't do this, in a release build, o would be collected by the GC.Collect call in IsExistingObject. (not in a Debug build)
}
public static bool IsExistingObject(object o)
{
var oRef = new WeakReference(o);
#if DEBUG
o = null; // In Debug, we need to set o to null. This is not necessary in a release build.
#endif
GC.Collect();
GC.WaitForPendingFinalizers();
return oRef.IsAlive;
}
}
This prints True on the first line, False on the second.
But again, please do not use this in your code.
Let me rewrite your question to something shorter.
Is there any way, in my method, which takes an object as an argument, to know if this object will ever be used outside of my method?
And the short answer to that is: No.
Let me venture an opinion at this point: There should not be any such mechanism either.
This would complicate method calls all over the place.
If there was a method where I could, in a method call, tell if the object I'm given would really be used or not, then it's a signal to me, as a developer of that method, to take that into account.
Basically, you'd see this type of code all over the place (hypothetical, since it isn't available/supported:)
if (ReferenceCount(obj) == 1) return; // only reference is the one we have
My opinion is this: If the code that calls your method isn't going to use the object for anything, and there are no side-effects outside of modifying the object, then that code should not exist to begin with.
It's like code that looks like this:
1 + 2;
What does this code do? Well, depending on the C/C++ compiler, it might compile into something that evaluates 1+2. But then what, where is the result stored? Do you use it for anything? No? Then why is that code part of your source code to begin with?
Of course, you could argue that the code is actually a+b;, and the purpose is to ensure that the evaluation of a+b isn't going to throw an exception denoting overflow, but such a case is so diminishingly rare that a special case for it would just mask real problems, and it would be really simple to fix by just assigning it to a temporary variable.
In any case, for any feature in any programming language and/or runtime and/or environment, where a feature isn't available, the reasons for why it isn't available are:
It wasn't designed properly
It wasn't specified properly
It wasn't implemented properly
It wasn't tested properly
It wasn't documented properly
It wasn't prioritized above competing features
All of these are required to get a feature to appear in version X of application Y, be it C# 4.0 or MS Works 7.0.
Nope, there's no way of knowing.
All that gets passed in is the object reference. Whether it is 'newed' in-situ, or is sourced from an array, the method in question has no way of knowing how the parameters being passed in have been instantiated and/or where.
One way to know if an object passed to a function (or a method) has been created right before the call to the function/method is that the object has a property that is initialized with the timestamp passed from a system function; in that way, looking at that property, it would be possible to resolve the problem.
Frankly, I would not use such method because
I don't see any reason why the code should now if the passed parameter is an object right created, or if it has been created in a different moment.
The method I suggest depends from a system function that in some systems could not be present, or that could be less reliable.
With the modern CPUs, which are a way faster than the CPUs used 10 years ago, there could be the problem to use the right value for the threshold value to decide when an object has been freshly created, or not.
The other solution would be to use an object property that is set to a a value from the object creator, and that is set to a different value from all the methods of the object.
In this case the problem would be to forget to add the code to change that property in each method.
Once again I would ask to myself "Is there a really need to do this?".
As a possible partial solution if you only wanted one of an object to be consumed by a method maybe you could look at a Singleton. In this way the method in question could not create another instance if it existed already.

Categories