In memory representation of large data - c#

Currently, I am working on a project where I need to bring GBs of data on to client machine to do some task and the task needs whole data as it do some analysis on the data and helps in decision making process.
so the question is, what are the best practices and suitable approach to manage that much amount of data into memory without hampering the performance of client machine and application.
note: at the time of application loading, we can spend time to bring data from database to client machine, that's totally acceptable in our case. but once the data is loaded into application at start up, performance is very important.

This is a little hard to answer without a problem statement, i.e. what problems you are currently facing, but the following is just some thoughts, based on some recent experiences we had in a similar scenario. It is, however, a lot of work to change to this type of model - so it also depends how much you can invest trying to "fix" it, and I can make no promise that "your problems" are the same as "our problems", if you see what I mean. So don't get cross if the following approach doesn't work for you!
Loading that much data into memory is always going to have some impact, however, I think I see what you are doing...
When loading that much data naively, you are going to have many (millions?) of objects and a similar-or-greater number of references. You're obviously going to want to be using x64, so the references will add up - but in terms of performance the biggesst problem is going to be garbage collection. You have a lot of objects that can never be collected, but the GC is going to know that you're using a ton of memory, and is going to try anyway periodically. This is something I looked at in more detail here, but the following graph shows the impact - in particular, those "spikes" are all GC killing performance:
For this scenario (a huge amount of data loaded, never released), we switched to using structs, i.e. loading the data into:
struct Foo {
private readonly int id;
private readonly double value;
public Foo(int id, double value) {
this.id = id;
this.value = value;
}
public int Id {get{return id;}}
public double Value {get{return value;}}
}
and stored those directly in arrays (not lists):
Foo[] foos = ...
the significance of that is that because some of these structs are quite big, we didn't want them copying themselves lots of times on the stack, but with an array you can do:
private void SomeMethod(ref Foo foo) {
if(foo.Value == ...) {blah blah blah}
}
// call ^^^
int index = 17;
SomeMethod(ref foos[index]);
Note that we've passed the object directly - it was never copied; foo.Value is actually looking directly inside the array. The tricky bit starts when you need relationships between objects. You can't store a reference here, as it is a struct, and you can't store that. What you can do, though, is store the index (into the array). For example:
struct Customer {
... more not shown
public int FooIndex { get { return fooIndex; } }
}
Not quite as convenient as customer.Foo, but the following works nicely:
Foo foo = foos[customer.FooIndex];
// or, when passing to a method, SomeMethod(ref foos[customer.FooIndex]);
Key points:
we're now using half the size for "references" (an int is 4 bytes; a reference on x64 is 8 bytes)
we don't have several-million object headers in memory
we don't have a huge object graph for GC to look at; only a small number of arrays that GC can look at incredibly quickly
but it is a little less convenient to work with, and needs some initial processing when loading
additional notes:
strings are a killer; if you have millions of strings, then that is problematic; at a minimum, if you have strings that are repeated, make sure you do some custom interning (not string.Intern, that would be bad) to ensure you only have one instance of each repeated value, rather than 800,000 strings with the same contents
if you have repeated data of finite length, rather than sub-lists/arrays, you might consider a fixed array; this requires unsafe code, but avoids another myriad of objects and references
As an additional footnote, with that volume of data, you should think very seriously about your serialization protocols, i.e. how you're sending the data down the wire. I would strongly suggest staying far away from things like XmlSerializer, DataContractSerializer or BinaryFormatter. If you want pointers on this subject, let me know.

Related

C# - how does variable scope and disposal impact processing efficiency?

I was having a discussion with a colleague the other day about this hypothetical situation. Consider this pseudocode:
public void Main()
{
MyDto dto = Repository.GetDto();
foreach(var row in dto.Rows)
{
ProcessStrings(row);
}
}
public void ProcessStrings(DataRow row)
{
string string1 = GetStringFromDataRow(row, 1);
string string2 = GetStringFromDataRow(row, 2);
// do something with the strings
}
Then this functionally identical alternative:
public void Main()
{
string1 = null;
string2 = null,
MyDto dto = Repository.GetDto();
foreach(var row in dto.Rows)
{
ProcessStrings(row, string1, string2)
}
}
public void ProcessStrings(DataRow row, string string1, string string2)
{
string1 = GetStringFromDataRow(row, 1);
string2 = GetStringFromDataRow(row, 2);
// do something with the strings
}
How will these differ in processing when running the compiled code? Are we right in thinking the second version is marginally more efficient because the string variables will take up less memory and only be disposed once, whereas in the first version, they're disposed of on each pass of the loop?
Would it make any difference if the strings in the second version were passed by ref or as out parameters?
When you're dealing with "marginally more efficient" level of optimizations you risk not seeing the whole picture and end up being "marginally less efficient".
This answer here risks the same thing, but with that caveat, let's look at the hypothesis:
Storing a string into a variable creates a new instance of the string
No, not at all. A string is an object, what you're storing in the variable is a reference to that object. On 32-bit systems this reference is 4 bytes in size, on 64-bit it is 8. Nothing more, nothing less. Moving 4/8 bytes around is overhead that you're not really going to notice a lot.
So neither of the two examples, with the very little information we have about the makings of the methods being called, creates more or less strings than the other so on this count they're equivalent.
So what is different?
Well in one example you're storing the two string references into local variables. This is most likely going to be cpu registers. Could be memory on the stack. Hard to say, depends on the rest of the code. Does it matter? Highly unlikely.
In the other example you're passing in two parameters as null and then reusing those parameters locally. These parameters can be passed as cpu registers or stack memory. Same as the other. Did it matter? Not at all.
So most likely there is going to be absolutely no difference at all.
Note one thing, you're mentioning "disposal". This term is reserved for the usage of objects implementing IDisposable and then the act of disposing of these by calling IDisposable.Dispose on those objects. Strings are not such objects, this is not relevant to this question.
If, instead, by disposal you mean "garbage collection", then since I already established that neither of the two examples creates more or less objects than the others due to the differences you asked about, this is also irrelevant.
This is not important, however. It isn't important what you or I or your colleague thinks is going to have an effect. Knowing is quite different, which leads me to...
The real tip I can give about optimization:
Measure
Measure
Measure
Understand
Verify that you understand it correctly
Change, if possible
You measure, use a profiler to find the real bottlenecks and real time spenders in your code, then understand why those are bottlenecks, then ensure your understanding is correct, then you can see if you can change it.
In your code I will venture a guess that if you were to profile your program you would find that those two examples will have absolutely no effect whatsoever on the running time. If they do have effect it is going to be on order of nanoseconds. Most likely, the very act of looking at the profiler results will give you one or more "huh, that's odd" realizations about your program, and you'll find bottlenecks that are far bigger fish than the variables in play here.
In both of your alternatives, GetStringFromDataRow creates new string every time. Whether you store a reference to this string in a local variable or in argument parameter variable (which is essentially not much different from local variable in your case) does not matter. Imagine you even not assigned result of GetStringFromDataRow to any variable - instance of string is still created and stored somewhere in memory until garbage collected. If you would pass your strings by reference - it won't make much difference. You will be able to reuse memory location to store reference to created string (you can think of it as the memory address of string instance), but not memory location for string contents.

.NET small class with many instances optimization scenario?

I have millions of instances of class Data, I seek optimization advise.
Is there a way to optimize it in any way - save memory for example by serializing it somehow, although it will hurt the retrieval speed which is important too. Maybe turning the class to struct - but it seems that the class is pretty large for struct.
Queries for this objects can take hundreds-millions of these objects at a time. They sit in a list and queried by DateTime. The results are aggregated in different ways, many calculation can be applied.
[Serializable]
[DataContract]
public abstract class BaseData {}
[Serializable]
public class Data : BaseData {
public byte member1;
public int member2;
public long member3;
public double member4;
public DateTime member5;
}
Unfortunately, while you did specify that you want to "optimize", you did not specify what the exact problem is you mean to tackle. So I cannot really give you more than general advice.
Serialization will not help you. Your Data objects are already stored as bytes in memory. Nor will turning it into a struct help. The difference between a struct and a class lies in their addressing and referencing behaviour, not in their memory footprint.
The only way I can think of to reduce the memory footprint of a collection with "hundreds-millions" of these objects would be to serialize and compress the entire thing. But that is not feasible. You would always have to decompress the entire thing before accessing it, which would shoot your performance to hell AND actually almost double the memory consumption on access (compressed and decompressed data both lying in memory at that point).
The best general advice I can give you is not to try to optimize this scenario yourself, but use specialized software for that. By specialized software, I mean a (in-memory) database. One example of a database you can use in-memory, and for which you already have everything you need on-board in the .NET framework, is SQLite.
I assume, as you seem to imply, that you have a class with many members, have a large number of instances, and need to keep them all in memory at the same time to perform calculations.
I ran a few tests to see if you could actually get different sizes for the classes you described.
I used this simple method for finding the in-memory size of an object:
private static void MeasureMemory()
{
int size = 10000000;
object[] array = new object[size];
long before = GC.GetTotalMemory(true);
for (int i = 0; i < size; i++)
{
array[i] = new Data();
}
long after = GC.GetTotalMemory(true);
double diff = after - before;
Console.WriteLine("Total bytes: " + diff);
Console.WriteLine("Bytes per object: " + diff / size);
}
It may be primitive, but I find that it works fine for situations like this.
As expected, almost nothing you can do to that class (turning it to a struct, removing the inheritance, or the method attributes) influences the memory being used by a single instance. As far as memory usage goes, they are all equivalent. However, do try to fiddle with your actual classes and run them through the given code.
The only way you could actually reduce the memory footprint of an instance would be to use smaller structures for keeping your data (int instead of long for example). If you have a large number of booleans, you could group them into a byte or integer, and have simple property wrappers to work with them (A boolean takes a byte of memory). These may be insignificant things in most situations, but for a hundred million objects, removing a boolean could make a difference of a hundred MB of memory. Also, be aware that the platform target you choose for your application can have an impact on the memory footprint of an object (x64 builds take up more memory then x86 ones).
Serializing the data is very unlikely to help. An in-memory database has it's upsides, especially if you are doing complex queries. However, it is unlikely to actually reduce the memory usage for your data. Unfortunately, there just aren't many ways to reduce the footprint of basic data types. At some point, you will just have to move to a file-based database.
However, here are some ideas. Please be aware that they are hacky, highly conditional, decrease the computation performance and will make the code harder to maintain.
It is often a case in large data structures that objects in different states will have only some properties filled, and the other will be set to null or a default value. If you can identify such groups of properties, perhaps you could move them to a sub-class, and have one reference that could be null instead of having several properties take up space. Then you only instantiate the sub-class once it is needed. You could write property wrappers that could hide this from the rest of the code. Have in mind that the worst case scenario here would have you keeping all the properties in memory, plus several object headers and pointers.
You could perhaps turn members that are likely to take a default value into binary representations, and then pack them into a byte array. You would know which byte positions represent which data member, and could write properties that could read them. Position the properties that are most likely to have a default value at the end of the byte array (a few longs that are often 0 for example). Then, when creating the object, adjust the byte array size to exclude the properties that have the default value, starting from the end of the list, until you hit the first member that has a non-default value. When the outside code requests a property, you can check if the byte array is large enough to hold that property, and if not, return the default value. This way, you could potentially save some space. Best case, you will have a null pointer to a byte array instead of several data members. Worst case, you will have full byte arrays taking as much space as the original data, plus some overhead for the array. The usefulness depends on the actual data, and assumes that you do relatively few writes, as the re-computation of the array will be expensive.
Hope any of this helps :)

String caching. Memory optimization and re-use

I am currently working on a very large legacy application which handles a large amount of string data gathered from various sources (IE, names, identifiers, common codes relating to the business etc). This data alone can take up to 200 meg of ram in the application process.
A colleague of mine mentioned one possible strategy to reduce the memory footprint (as a lot of the individual strings are duplicate across the data sets), would be to "cache" the recurring strings in a dictionary and re-use them when required. So for example…
public class StringCacher()
{
public readonly Dictionary<string, string> _stringCache;
public StringCacher()
{
_stringCache = new Dictionary<string, string>();
}
public string AddOrReuse(string stringToCache)
{
if (_stringCache.ContainsKey(stringToCache)
_stringCache[stringToCache] = stringToCache;
return _stringCache[stringToCache];
}
}
Then to use this caching...
public IEnumerable<string> IncomingData()
{
var stringCache = new StringCacher();
var dataList = new List<string>();
// Add the data, a fair amount of the strings will be the same.
dataList.Add(stringCache.AddOrReuse("AAAA"));
dataList.Add(stringCache.AddOrReuse("BBBB"));
dataList.Add(stringCache.AddOrReuse("AAAA"));
dataList.Add(stringCache.AddOrReuse("CCCC"));
dataList.Add(stringCache.AddOrReuse("AAAA"));
return dataList;
}
As strings are immutable and a lot of internal work is done by the framework to make them work in a similar way to value types i'm half thinking that this will just create a copy of each the string into the dictionary and just double the amount of memory used rather than just pass a reference to the string stored in the dictionary (which is what my colleague is assuming).
So taking into account that this will be run on a massive set of string data...
Is this going to save any memory, assuming that 30% of the string values will be used twice or more?
Is the assumption that this will even work correct?
This is essentially what string interning is, except you don't have to worry how it works. In your example you are still creating a string, then comparing it, then leaving the copy to be disposed of. .NET will do this for you in runtime.
See also String.Intern and Optimizing C# String Performance (C Calvert)
If a new string is created with code like (String goober1 = "foo"; String goober2 = "foo";) shown in lines 18 and 19, then the intern table is checked. If your string is already in there, then both variables will point at the same block of memory maintained by the intern table.
So, you don't have to roll your own - it won't really provide any advantage. EDIT UNLESS: your strings don't usually live for as long as your AppDomain - interned strings live for the lifetime of the AppDomain, which is not necessarily great for GC. If you want short lived strings, then you want a pool. From String.Intern:
If you are trying to reduce the total amount of memory your application allocates, keep in mind that interning a string has two unwanted side effects. First, the memory allocated for interned String objects is not likely be released until the common language runtime (CLR) terminates. The reason is that the CLR's reference to the interned String object can persist after your application, or even your application domain, terminates. ...
EDIT 2 Also see Jon Skeets SO answer here
This is already built-in .NET, it's called String.Intern, no need to reinvent.
You can acheive this using the built in .Net functionality.
When you initialise your string, make a call to string.Intern() with your string.
For example:
dataList.Add(string.Intern("AAAA"));
Every subsequent call with the same string will use the same reference in memory. So if you have 1000 AAAAs, only 1 copy of AAAA is stored in memory.

Efficient temporary dataset

I have a class in C# that is storing information on a stack to be used by other pieces of the application later. The information is currently stored in a class (without any methods) and consists of some ints, shorts, floats and a couple of boolean values.
I can be processing 10-40 of these packets every second - potentially more - and the information on the stack will be removed when it's needed by another part of the program; however this isn't guaranteed to occur at any definite interval. The information also has a chance of being incomplete (I'm processing packets from a network connection).
Currently I have this represented as such:
public class PackInfo
{
public boolean active;
public float f1;
public float f2;
public int i1;
public int i2;
public int i3;
public int i4;
public short s1;
public short s2;
}
Is there a better way that this information can be represented? There's no chance of the stack getting too large (most of the information will be cleared if it starts getting too big) but I'm worried that there will be a needless amount of memory overhead involved in creating so many instances of the class to act as little more than a container for this information. Even though this is neither a computationally complex or memory-consuming task, I don't see it scaling well should it become either.
This sounds like it would be a good idea to use a generic Queue for storing these. I have the assumption that you're handling these "messages" in order.
As for the overhead of instantiating these classes, I don't think that instantiating 10-40 per second would have any visible impact on performance, the actual processing you do afterwards would likely be a much better candidate for optimization than the cost of instantiation.
Also, I would recommend only optimizing when you can actually measure performance in your application, otherwise you might be wasting your time doing premature optimization.

Is it a good idea to create a custom type for the primary key of each data table?

We have a lot of code that passes about “Ids” of data rows; these are mostly ints or guids. I could make this code safer by creating a different struct for the id of each database table. Then the type checker will help to find cases when the wrong ID is passed.
E.g the Person table has a column calls PersonId and we have code like:
DeletePerson(int personId)
DeleteCar(int carId)
Would it be better to have:
struct PersonId
{
private int id;
// GetHashCode etc....
}
DeletePerson(PersionId persionId)
DeleteCar(CarId carId)
Has anyone got real life experience
of dong this?
Is it worth the overhead?
Or more pain then it is worth?
(It would also make it easier to change the data type in the database of the primary key, that is way I thought of this ideal in the first place)
Please don’t say use an ORM some other big change to the system design as I know an ORM would be a better option, but that is not under my power at present. However I can make minor changes like the above to the module I am working on at present.
Update:
Note this is not a web application and the Ids are kept in memory and passed about with WCF, so there is no conversion to/from strings at the edge. There is no reason that the WCF interface can’t use the PersonId type etc. The PersonsId type etc could even be used in the WPF/Winforms UI code.
The only inherently "untyped" bit of the system is the database.
This seems to be down to the cost/benefit of spending time writing code that the compiler can check better, or spending the time writing more unit tests. I am coming down more on the side of spending the time on testing, as I would like to see at least some unit tests in the code base.
It's hard to see how it could be worth it: I recommend doing it only as a last resort and only if people are actually mixing identifiers during development or reporting difficulty keeping them straight.
In web applications in particular it won't even offer the safety you're hoping for: typically you'll be converting strings into integers anyway. There are just too many cases where you'll find yourself writing silly code like this:
int personId;
if (Int32.TryParse(Request["personId"], out personId)) {
this.person = this.PersonRepository.Get(new PersonId(personId));
}
Dealing with complex state in memory certainly improves the case for strongly-typed IDs, but I think Arthur's idea is even better: to avoid confusion, demand an entity instance instead of an identifier. In some situations, performance and memory considerations could make that impractical, but even those should be rare enough that code review would be just as effective without the negative side-effects (quite the reverse!).
I've worked on a system that did this, and it didn't really provide any value. We didn't have ambiguities like the ones you're describing, and in terms of future-proofing, it made it slightly harder to implement new features without any payoff. (No ID's data type changed in two years, at any rate - it's could certainly happen at some point, but as far as I know, the return on investment for that is currently negative.)
I wouldn't make a special id for this. This is mostly a testing issue. You can test the code and make sure it does what it is supposed to.
You can create a standard way of doing things in your system than help future maintenance (similar to what you mention) by passing in the whole object to be manipulated. Of course, if you named your parameter (int personID) and had documentation then any non malicious programmer should be able to use the code effectively when calling that method. Passing a whole object will do that type matching that you are looking for and that should be enough of a standardized way.
I just see having a special structure made to guard against this as adding more work for little benefit. Even if you did this, someone could come along and find a convenient way to make a 'helper' method and bypass whatever structure you put in place anyway so it really isn't a guarantee.
You can just opt for GUIDs, like you suggested yourself. Then, you won't have to worry about passing a person ID of "42" to DeleteCar() and accidentally delete the car with ID of 42. GUIDs are unique; if you pass a person GUID to DeleteCar in your code because of a programming typo, that GUID will not be a PK of any car in the database.
You could create a simple Id class which can help differentiate in code between the two:
public class Id<T>
{
private int RawValue
{
get;
set;
}
public Id(int value)
{
this.RawValue = value;
}
public static explicit operator int (Id<T> id) { return id.RawValue; }
// this cast is optional and can be excluded for further strictness
public static implicit operator Id<T> (int value) { return new Id(value); }
}
Used like so:
class SomeClass
{
public Id<Person> PersonId { get; set; }
public Id<Car> CarId { get; set; }
}
Assuming your values would only be retrieved from the database, unless you explicitly cast the value to an integer, it is not possible to use the two in each other's place.
I don't see much value in custom checking in this case. You might want to beef up your testing suite to check that two things are happening:
Your data access code always works as you expect (i.e., you aren't loading inconsistent Key information into your classes and getting misuse because of that).
That your "round trip" code is working as expected (i.e., that loading a record, making a change and saving it back isn't somehow corrupting your business logic objects).
Having a data access (and business logic) layer you can trust is crucial to being able to address the bigger pictures problems you will encounter attempting to implement the actual business requirements. If your data layer is unreliable you will be spending a lot of effort tracking (or worse, working around) problems at that level that surface when you put load on the subsystem.
If instead your data access code is robust in the face of incorrect usage (what your test suite should be proving to you) then you can relax a bit on the higher levels and trust they will throw exceptions (or however you are dealing with it) when abused.
The reason you hear people suggesting an ORM is that many of these issues are dealt with in a reliable way by such tools. If your implementation is far enough along that such a switch would be painful, just keep in mind that your low level data access layer needs to be as robust as an good ORM if you really want to be able to trust (and thus forget about to a certain extent) your data access.
Instead of custom validation, your testing suite could inject code (via dependency injection) that does robust tests of your Keys (hitting the database to verify each change) as the tests run and that injects production code that omits or restricts such tests for performance reasons. Your data layer will throw errors on failed keys (if you have your foreign keys set up correctly there) so you should also be able to handle those exceptions.
My gut says this just isn't worth the hassle. My first question to you would be whether you actually have found bugs where the wrong int was being passed (a Car ID instead of a Person ID in your example). If so, it is probably more of a case of worse overall architecture in that your Domain objects have too much coupling, and are passing too many arguments around in method parameters rather than acting on internal variables.

Categories