I have 8 uints which represent a security key like this:
uint firstParam = ...;
uint secondParam = ...;
uint thirdParam = ...;
uint etcParam = ...;
uint etcParam = ...;
They are allocated as local variables, inside of an UNSAFE method.
Those keys are very sensitive.
I was wondering do those locals on the stack get deleted when the method is over? Does the UNSAFE method have an affect on this? MSDN says that Unsafe code is automatically pinned in memory.
If they are not removed from memory, will assigning them all to 0 help at the end of the method, even though analyzers will say this has no effect?
So I tested zeroing out the variables. However, in x64 Release mode the zeroing is removed from the final product (checked using ILSpy)
Is there any way to stop this?
Here is the sample code (in x64 Release)
private static void Main(string[] args)
{
int num = new Random().Next(10, 100);
Console.WriteLine(num);
MethodThatDoesSomething(num);
num = 0; // This line is removed!
Console.ReadLine();
}
private static void MethodThatDoesSomething(int num)
{
Console.WriteLine(num);
}
The num = 0 statement is removed in x64 release.
I cannot use SecureString because I'm P/Invoking into a native method which takes the UInts as a paramter.
I'm P/Invoking into the unmanaged method AllocateAndInitializeSid, which takes 8 uints as parameters. What could I do in this scenerio?
I have tried adding
[MethodImpl(MethodImplOptions.NoInlining | MethodImplOptions.NoOptimization)]
to the sample code (above Main method), however, the num = 0 is STILL removed!
EDIT: after some reasoning I've come to correct this answer.
DO NOT use SecureString, as #Servy and #Alejandro point out in the comments, it is not considered really secure anymore and will give a misguided sense of security, probably leading to futhering unconsidered exposures.
I have striked the passages I'm not comfortable with anymore and, in their place, would recommend as follows.
To assign firstParam use:
firstParam = value ^ OBFUSCATION_MASK;
To read firstParam use (again):
firstParam ^ OBFUSCATION_MASK;
The ^ (bitwise XOR) operator is the inverse of itself, so applying it twice returns the original value. By reducing the time the value exists without obfuscation (for the CPU time is actually the number of machine code cycles), its exposure is also reduced. When the value is stored for long-term (say, 2-3 microseconds) it should always be obfuscated. For example:
private static uint firstParam; // use static so that the compiler cannot remove apparently "useless" assignments
public void f()
{
// somehow acquire the value (network? encrypted file? user input?)
firstParam = externalSourceFunctionNotInMyCode() ^ OBFUSCATION_MASK; // obfuscate immediately
}
Then, several microseconds later:
public void g()
{
// use the value
externalUsageFunctionNotInMyCode(firstParam ^ OBFUSCATION_MASK);
}
The two external[Source|Usage]FunctionNotInMyCode() are entry and exit points of the value. The important thing is that as long as the value is stored in my code it is never in the plain, it's always obfuscated. What happens before and after my code is not under our control and we must live with it. At some point values must enter and/or exit. Otherwise what program would that be?
One last note is about the OBFUSCATION_MASK. I would randomize it for every start of the application, but ensure that the entropy is high enough, that means that the count of 0 and 1 is maybe not fifty/fifty, but near it. I think RNGCryptoServiceProvider will suffice. If not, it's always possible to count the bits or compute the entropy:
private static readonly uint OBFUSCATION_MASK = cryptographicallyStrongRandomizer();
At that point it's relatively difficult to identify the sensitive values in the binary soup and maybe even irrelevant if the data was paged out to disk.
As always, security must be balanced with cost and efficiency (in this case, also readability and maintainability).
ORIGINAL ANSWER:
Even with pinned unmanaged memory you cannot be sure if the physical memory is paged out to the disk by the OS.
In fact, in nations where Internet Bars are very common, clients may use your program on a publicly accessible machine. An attacker may try and do as follows:
compromise a machine by running a process that occasionally allocates all the RAM available;
wait for other clients to use that machine and run a program with sensitive data (such as username and password);
once the rogue program exhausts all RAM, the OS will page out the virtual memory pages to disk;
after several hours of usage by other clients the attacker comes back to the machine to copy unused sectors and slack space to an external device;
his hope is that pagefile.sys changed sectors several times (this occurs through sector rotation and such, which may not be avoided by the OS and can depend on hardware/firmware/drivers);
he brings the external device to his dungeon and slowly but patiently analyze the gathered data, which is mainly binary gibberish, but may have slews of ASCII characters.
By analyzing the data with all the time in the world and no pressure at all, he may find those sectors to which pagefile.sys has been written several "writes" before. There, the content of the RAM and thus heap/stack of programs can be inspected.
If a program stored sensitive data in a string, this procedure would expose it.
Now, you're using uint not string, but the same principles still apply. To be sure to not expose any sensitive data, even if paged out to disk, you can use secure versions of types, such as SecureString.
The usage of uint somewhat protects you from ASCII scanning, but to be really sure you should never store sensitive data in unsafe variables, which means you should somehow convert the uint into a string representation and store it exclusively in a SecureString.
Hope that helps someone implementing secure apps.
In .NET, you can never be sure that variables are actually cleared from memory.
Since the CLR manages the memory, it's free to move them around, liberally leaving old copies behind, including if you purposely overwrite them with zeroes o other random values. A memory analyzer or a debugger may still be able to get them if it has enough privileges.
So what can you do about it?
Just terminating the method leaves the data behind in the stack, and they'll be eventually overwritten by something else, without any certainity of when (or if) it'll happen.
Manually overwriting it will help, provided the compiler doesn't optimize out the "useless" assignment (see this thread for details). This will be more likely to success if the variables are short-lived (before the GC had the chance to move them around), but you still have NO guarrantes that there won't be other copies in other places.
The next best thing you can do is to terminate the whole process immediately, preferably after overwritting them too. This way the memory returns to the OS, and it'll clear it before giving it away to another process. You're still at the mercy of kernel-mode analyzers, though, but now you've raised the bar significantly.
Related
I want to process many integers in a class, so I listed them into an int* array.
int*[] pp = new int*[]{&aaa,&bbb,&ccc};
However, the compiler declined the code above with the following EXCUSE:
> You can only take the address of an unfixed expression inside of a fixed statement initializer
I know I can change the code above to avoid this error; however, we need to consider ddd and eee will join the array in the future.
public enum E {
aaa,
bbb,
ccc,
_count
}
for(int i=0;i<(int)E._count;i++)
gg[(int)E.bbb]
Dictionary<string,int>ppp=new Dictionary<string,int>();
ppp["aaa"]=ppp.Count;
ppp["bbb"]=ppp.Count;
ppp["ccc"]=ppp.Count;
gg[ppp["bbb"]]
These solution works, but they make the code and the execution time longer.
I also expect a nonofficial patch to the compiler or a new nonofficial C# compiler, but I have not seen an available download for many years; it seems very difficult to have one for us.
Are there better ways so that
I do not need to count the count of the array ppp.
If the code becomes long, there are only several letters longer.
The execution time does not increase much.
To add ddd and eee into the array, there are only one or two
setences for each new member.
.NET runtime is a managed execution runtime which (among other things) provides garbage collection. .NET garbage collector (GC)
not only manages the allocation and release of memory, but also transparently moves the objects around the "managed heap", blocking
the rest of your code while doing it.
It also compacts (defragments) the memory by moving longer lived objects together, and even "promoting" them into different parts of the heap, called generations, to avoid checking their status too often.
There is a bunch of memory being copied all the time without your program even realizing it. Since garbage collection is an operation that can happen at any time during the execution of your program, any pointer-related
("unsafe") operations must be done within a small scope, by telling the runtime to "pin" the objects using the fixed keyword. This prevents the GC from moving them, but only for a while.
Using pointers and unsafe code in C# is not only less safe, but also not very idiomatic for managed languages in general. If coming from a C background, you may feel like at home with these constructs, but C# has a completely different philosophy: your job as a C# programmer should be to write reliable, readable and maintenable code, and only then think about squeezing a couple of CPU cycles for performance reasons. You can use pointers from time to time in small functions, doing some very specific, time-critical code. But even then it is your duty to profile before making such optimizations. Even the most experienced programmers often fail at predicting bottlenecks before profiling.
Finally, regarding your actual code:
I don't see why you think this:
int*[] pp = new int*[] {&aaa, &bbb, &ccc};
would be any more performant than this:
int[] pp = new int[] {aaa, bbb, ccc};
On a 32-bit machine, an int and a pointer are of the same size. On a 64-bit machine, a pointer is even bigger.
Consider replacing these plain ints with a class of your own which will provide some context and additional functionality/data to each of these values. Create a new question describing the actual problem you are trying to solve (you can also use Code Review for such questions) and you will benefit from much better suggestions.
I was having a discussion with a colleague the other day about this hypothetical situation. Consider this pseudocode:
public void Main()
{
MyDto dto = Repository.GetDto();
foreach(var row in dto.Rows)
{
ProcessStrings(row);
}
}
public void ProcessStrings(DataRow row)
{
string string1 = GetStringFromDataRow(row, 1);
string string2 = GetStringFromDataRow(row, 2);
// do something with the strings
}
Then this functionally identical alternative:
public void Main()
{
string1 = null;
string2 = null,
MyDto dto = Repository.GetDto();
foreach(var row in dto.Rows)
{
ProcessStrings(row, string1, string2)
}
}
public void ProcessStrings(DataRow row, string string1, string string2)
{
string1 = GetStringFromDataRow(row, 1);
string2 = GetStringFromDataRow(row, 2);
// do something with the strings
}
How will these differ in processing when running the compiled code? Are we right in thinking the second version is marginally more efficient because the string variables will take up less memory and only be disposed once, whereas in the first version, they're disposed of on each pass of the loop?
Would it make any difference if the strings in the second version were passed by ref or as out parameters?
When you're dealing with "marginally more efficient" level of optimizations you risk not seeing the whole picture and end up being "marginally less efficient".
This answer here risks the same thing, but with that caveat, let's look at the hypothesis:
Storing a string into a variable creates a new instance of the string
No, not at all. A string is an object, what you're storing in the variable is a reference to that object. On 32-bit systems this reference is 4 bytes in size, on 64-bit it is 8. Nothing more, nothing less. Moving 4/8 bytes around is overhead that you're not really going to notice a lot.
So neither of the two examples, with the very little information we have about the makings of the methods being called, creates more or less strings than the other so on this count they're equivalent.
So what is different?
Well in one example you're storing the two string references into local variables. This is most likely going to be cpu registers. Could be memory on the stack. Hard to say, depends on the rest of the code. Does it matter? Highly unlikely.
In the other example you're passing in two parameters as null and then reusing those parameters locally. These parameters can be passed as cpu registers or stack memory. Same as the other. Did it matter? Not at all.
So most likely there is going to be absolutely no difference at all.
Note one thing, you're mentioning "disposal". This term is reserved for the usage of objects implementing IDisposable and then the act of disposing of these by calling IDisposable.Dispose on those objects. Strings are not such objects, this is not relevant to this question.
If, instead, by disposal you mean "garbage collection", then since I already established that neither of the two examples creates more or less objects than the others due to the differences you asked about, this is also irrelevant.
This is not important, however. It isn't important what you or I or your colleague thinks is going to have an effect. Knowing is quite different, which leads me to...
The real tip I can give about optimization:
Measure
Measure
Measure
Understand
Verify that you understand it correctly
Change, if possible
You measure, use a profiler to find the real bottlenecks and real time spenders in your code, then understand why those are bottlenecks, then ensure your understanding is correct, then you can see if you can change it.
In your code I will venture a guess that if you were to profile your program you would find that those two examples will have absolutely no effect whatsoever on the running time. If they do have effect it is going to be on order of nanoseconds. Most likely, the very act of looking at the profiler results will give you one or more "huh, that's odd" realizations about your program, and you'll find bottlenecks that are far bigger fish than the variables in play here.
In both of your alternatives, GetStringFromDataRow creates new string every time. Whether you store a reference to this string in a local variable or in argument parameter variable (which is essentially not much different from local variable in your case) does not matter. Imagine you even not assigned result of GetStringFromDataRow to any variable - instance of string is still created and stored somewhere in memory until garbage collected. If you would pass your strings by reference - it won't make much difference. You will be able to reuse memory location to store reference to created string (you can think of it as the memory address of string instance), but not memory location for string contents.
I have an Int64 containing two Int32 like this:
[StructLayout(LayoutKind.Explicit)]
public struct PackedInt64
{
[FieldOffset(0)]
public Int64 All;
[FieldOffset(0)]
public Int32 First;
[FieldOffset(4)]
public Int32 Second;
}
Now I want constructors (for all, first and second). However the struct requires all fields to be assigned before the constructor is exited.
Consider the all constructor.
public PackedInt64(Int64 all)
{
this.First = 0;
this.Second = 0;
Thread.MemoryBarrier();
this.All = all;
}
I want to be absolutely sure that this.All is assigned last in the constructor so that half of the field or more isn't overwritten in case of some compiler optimization or instruction reordering in the cpu.
Is Thread.MemoryBarrier() sufficient? Is it the best option?
Yes, this is the correct and best way of preventing reordering.
By executing Thread.MemoryBarrier() in your sample code, the processor will never be allowed to reorder instructions in such a way that the access/modification to First or Second will occur after the access/modification to All. Since they both occupy the same address space, you don't have to worry about your later changes being overwritten by your earlier ones.
Note that Thread.MemoryBarrier() only works for the current executing thread -- it isn't a type of lock. However, given that this code is running in a constructor and no other thread can yet have access to this data, this should be perfectly fine. If you do need cross-thread guarantee of operations, however, you'll need to use a locking mechanism to guarantee exclusive access.
Note that you may not actually need this instruction on x86 based machines, but I would still recommend the code in case you run on another platform one day (such as IA64). See the below chart for what platforms will reorder memory post-save, rather than just post-load.
The MemoryBarrier will prevent re-ordering, but this code is still broken.
LayoutKind.Explicit and FieldOffsetAttribute are documented as affecting the memory layout of the object when it is passed to unmanaged code. It can be used to interop with a C union, but it cannot be used to emulate a C union.
Even if it currently acts the way you expect, on the platform you tested, there is no guarantee that it will continue to do so. The only guarantee made is in the context of interop with unmanaged code (that is, p/invoke, COM interop, or C++/CLI it-just-works).
If you want to read a subset of bytes in a portable future-proof manner, you'll have to use bitwise operations or a byte array and BitConverter. Even if the syntax isn't as nice.
Check the remarks section of the following link: http://msdn.microsoft.com/en-us/library/system.threading.thread.memorybarrier.aspx
It says MemoryBarrier() is required only on multiprocessor systems with weak memory ordering. So, this is a sufficient option but whether this is the best option or not depends upon the system you are using.
First, I'm aware this answer doesn't really solve the reordering problem, but negates it. By using unsafe code, you can avoid writing to First and Second completely.
public unsafe PackedInt64(long all) {
fixed (PackedInt64* ptr = &this)
*(long*) ptr = all;
}
It's not meant to be the most elegant solution and probably doesn't pass most company policies regarding managed code, but it should work.
Currently, I am working on a project where I need to bring GBs of data on to client machine to do some task and the task needs whole data as it do some analysis on the data and helps in decision making process.
so the question is, what are the best practices and suitable approach to manage that much amount of data into memory without hampering the performance of client machine and application.
note: at the time of application loading, we can spend time to bring data from database to client machine, that's totally acceptable in our case. but once the data is loaded into application at start up, performance is very important.
This is a little hard to answer without a problem statement, i.e. what problems you are currently facing, but the following is just some thoughts, based on some recent experiences we had in a similar scenario. It is, however, a lot of work to change to this type of model - so it also depends how much you can invest trying to "fix" it, and I can make no promise that "your problems" are the same as "our problems", if you see what I mean. So don't get cross if the following approach doesn't work for you!
Loading that much data into memory is always going to have some impact, however, I think I see what you are doing...
When loading that much data naively, you are going to have many (millions?) of objects and a similar-or-greater number of references. You're obviously going to want to be using x64, so the references will add up - but in terms of performance the biggesst problem is going to be garbage collection. You have a lot of objects that can never be collected, but the GC is going to know that you're using a ton of memory, and is going to try anyway periodically. This is something I looked at in more detail here, but the following graph shows the impact - in particular, those "spikes" are all GC killing performance:
For this scenario (a huge amount of data loaded, never released), we switched to using structs, i.e. loading the data into:
struct Foo {
private readonly int id;
private readonly double value;
public Foo(int id, double value) {
this.id = id;
this.value = value;
}
public int Id {get{return id;}}
public double Value {get{return value;}}
}
and stored those directly in arrays (not lists):
Foo[] foos = ...
the significance of that is that because some of these structs are quite big, we didn't want them copying themselves lots of times on the stack, but with an array you can do:
private void SomeMethod(ref Foo foo) {
if(foo.Value == ...) {blah blah blah}
}
// call ^^^
int index = 17;
SomeMethod(ref foos[index]);
Note that we've passed the object directly - it was never copied; foo.Value is actually looking directly inside the array. The tricky bit starts when you need relationships between objects. You can't store a reference here, as it is a struct, and you can't store that. What you can do, though, is store the index (into the array). For example:
struct Customer {
... more not shown
public int FooIndex { get { return fooIndex; } }
}
Not quite as convenient as customer.Foo, but the following works nicely:
Foo foo = foos[customer.FooIndex];
// or, when passing to a method, SomeMethod(ref foos[customer.FooIndex]);
Key points:
we're now using half the size for "references" (an int is 4 bytes; a reference on x64 is 8 bytes)
we don't have several-million object headers in memory
we don't have a huge object graph for GC to look at; only a small number of arrays that GC can look at incredibly quickly
but it is a little less convenient to work with, and needs some initial processing when loading
additional notes:
strings are a killer; if you have millions of strings, then that is problematic; at a minimum, if you have strings that are repeated, make sure you do some custom interning (not string.Intern, that would be bad) to ensure you only have one instance of each repeated value, rather than 800,000 strings with the same contents
if you have repeated data of finite length, rather than sub-lists/arrays, you might consider a fixed array; this requires unsafe code, but avoids another myriad of objects and references
As an additional footnote, with that volume of data, you should think very seriously about your serialization protocols, i.e. how you're sending the data down the wire. I would strongly suggest staying far away from things like XmlSerializer, DataContractSerializer or BinaryFormatter. If you want pointers on this subject, let me know.
I have heard conflicting stories on this topic and am looking for a little bit of clarity.
How would one dispose of a string object immediately, or at the very least clear traces of it?
That depends. Literal strings are interned per default, so even if you application no longer references it it will not be collected, as it is referenced by the internal interning structure. Other strings are just like any other managed object. As soon as they are no longer reference by your application they are eligible for garbage collection.
More about interning here in this question: Where do Java and .NET string literals reside?
If you need to protect a string and be able to dispose it when you want, use System.Security.SecureString class.
Protect sensitive data with .NET 2.0's SecureString class
I wrote a little extension method for the string class for situations like this, it's probably the only sure way of ensuring the string itself is unreadable until collected. Obviously only works on dynamically generated strings, not literals.
public unsafe static void Clear(this string s)
{
fixed(char* ptr = s)
{
for(int i = 0; i < s.Length; i++)
{
ptr[i] = '\0';
}
}
}
This is all down to the garbage collector to handle that for you. You can force it to run a clean-up by calling GC.Collect(). From the docs:
Use this method to try to reclaim all
memory that is inaccessible.
All objects, regardless of how long
they have been in memory, are
considered for collection; however,
objects that are referenced in managed
code are not collected. Use this
method to force the system to try to
reclaim the maximum amount of
available memory.
That's the closest you'll get me thinks!!
I will answer this question from a security perspective.
If you want to destroy a string for security reasons, then it is probably because you don't want anyone snooping on your secret information, and you expect they might scan the memory, or find it in a page file or something if the computer is stolen or otherwise compromised.
The problem is that once a System.String is created in a managed application, there is not really a lot you can do about it. There may be some sneaky way of doing some unsafe reflection and overwriting the bytes, but I can't imagine that such things would be reliable.
The trick is to never put the info in a string at all.
I had this issue one time with a system that I developed for some company laptops. The hard drives were not encrypted, and I knew that if someone took a laptop, then they could easily scan it for sensitive info. I wanted to protect a password from such attacks.
The way I delt with it is this: I put the password in a byte array by capturing key press events on the textbox control. The textbox never contained anything but asterisks and single characters. The password never existed as a string at any time. I then hashed the byte array and zeroed the original. The hash was then XORed with a random hard-coded key, and this was used to encrypt all the sensitive data.
After everything was encrypted, then the key was zeroed out.
Naturally, some of the data might exist in the page file as plaintext, and it's also possible that the final key could be inspected as well. But nobody was going to steal the password dang it!
There's no deterministic way to clear all traces of a string (System.String) from memory. Your only options are to use a character array or a SecureString object.
One of the best ways to limit the lifetime of string objects in memory is to declare them as local variables in the innermost scope possible and not as private member variables on a class.
It's a common mistake for junior developers to declare their strings 'private string ...' on the class itself.
I've also seen well-meaning experienced developers trying to cache some complex string concatenation (a+b+c+d...) in a private member variable so they don't have to keep calculating it. Big mistake - it takes hardly any time to recalculate it, the temporary strings are garbage collected almost immediately when the first generation of GC happens, and the memory swallowed by caching all those strings just took available memory away from more important items like cached database records or cached page output.
Set the string variable to null once you don't need it.
string s = "dispose me!";
...
...
s = null;
and then call GC.Collect() to revoke garbage collector, but GC CANNOT guarantee the string will be collected immediately.