How does String.Remove() operates regarding memory?

How does String.Remove() operates regarding memory? - c#

I was wondering how does .NET's string.Remove() method operates regarding memory.
If I have the following piece of code:
string sample = "abc";
sample = sample.Remove(0);
What will actually happen in memory?
If I understand correctly, We've allocated a string consisting of 3 chars, and then we removed all of them on a new copy of the string, assigned the copy to the old reference, by that overriding it, and then what? What happens to those 3 characters?
If we're not pointing to them anymore, and they're not freed up (at least not that I'm aware of), they will remain in memory as garbage.
However, I'm sure the CLR has some way of detecting it and freeing them up eventually.
So any of you guys know what happens here? Thanks in advance!

First Remove is going to create a new string that has no characters in it (an empty string). This will involve the allocation of a char array an a string object to wrap it. Then you'll assign a reference to that string to your local variable.
Since the string "abc" is a literal string, it'll still exist in the intern pool, unless you've disabled interning of compile time literal strings, so it won't be garbage collected.
So in summary, you've created two new objects and changed the reference of the variable sample from the old object to the new one.

According to the source code: http://referencesource.microsoft.com/#mscorlib/system/string.cs
The method Remove() allocates a new string object and returns the results to you
In your code sample, the sample variable is replaced with a new string object that no longer has the first character
When the garbage collector fires, the orphaned string is reclaimed.

Related

Difference between assigning object to variable and passing it as argument VS directly creating object and passing it in argument

What is the difference in both and which is recommended and why?
var getDetailsReq = new getTransactionDetailsRequest
{
transId = transactionResponse.Payload.Id
};
var getDetailsCont = new getTransactionDetailsController(getDetailsReq);
vs
var getDetailsCont = new getTransactionDetailsController(new getTransactionDetailsRequest
{
transId = transactionResponse.Payload.Id
});
The first one holds the address of the object in the memory and it will clean out when dispose off
The second one will go untraceable and will be lost somewhere in memory
makes sense or do you have something to correct me?

They are functionally equivalent if the reference is not use again by the caller. It's very possible the optimizer will remove the temporary variable anyway if it is not used by the calling scope.
The second one will go untraceable and will be lost somewhere in memory
Well, you pass it to getTransactionDetailsController, so presumably it does something with the reference. Once the garbage collector detects that no objects have a reference to the object, it will be garbage collected (not disposed).
So use whichever one you feel is better - there is no practical guidance that I know of.

Performance wise? Nothing.
Readability-wise? I'd argue that the first one is more readable and maintainable than the second one:
You can set a breakpoint on the appropriate constructor when debugging.
You can easily inspect / watch the values of getDetailsReq.

If you need getDetailsReq somewhere else in your code, use method 1. Otherwise, it shouldn't make a difference

C# - how does variable scope and disposal impact processing efficiency?

I was having a discussion with a colleague the other day about this hypothetical situation. Consider this pseudocode:
public void Main()
{
MyDto dto = Repository.GetDto();
foreach(var row in dto.Rows)
{
ProcessStrings(row);
}
}
public void ProcessStrings(DataRow row)
{
string string1 = GetStringFromDataRow(row, 1);
string string2 = GetStringFromDataRow(row, 2);
// do something with the strings
}
Then this functionally identical alternative:
public void Main()
{
string1 = null;
string2 = null,
MyDto dto = Repository.GetDto();
foreach(var row in dto.Rows)
{
ProcessStrings(row, string1, string2)
}
}
public void ProcessStrings(DataRow row, string string1, string string2)
{
string1 = GetStringFromDataRow(row, 1);
string2 = GetStringFromDataRow(row, 2);
// do something with the strings
}
How will these differ in processing when running the compiled code? Are we right in thinking the second version is marginally more efficient because the string variables will take up less memory and only be disposed once, whereas in the first version, they're disposed of on each pass of the loop?
Would it make any difference if the strings in the second version were passed by ref or as out parameters?

When you're dealing with "marginally more efficient" level of optimizations you risk not seeing the whole picture and end up being "marginally less efficient".
This answer here risks the same thing, but with that caveat, let's look at the hypothesis:
Storing a string into a variable creates a new instance of the string
No, not at all. A string is an object, what you're storing in the variable is a reference to that object. On 32-bit systems this reference is 4 bytes in size, on 64-bit it is 8. Nothing more, nothing less. Moving 4/8 bytes around is overhead that you're not really going to notice a lot.
So neither of the two examples, with the very little information we have about the makings of the methods being called, creates more or less strings than the other so on this count they're equivalent.
So what is different?
Well in one example you're storing the two string references into local variables. This is most likely going to be cpu registers. Could be memory on the stack. Hard to say, depends on the rest of the code. Does it matter? Highly unlikely.
In the other example you're passing in two parameters as null and then reusing those parameters locally. These parameters can be passed as cpu registers or stack memory. Same as the other. Did it matter? Not at all.
So most likely there is going to be absolutely no difference at all.
Note one thing, you're mentioning "disposal". This term is reserved for the usage of objects implementing IDisposable and then the act of disposing of these by calling IDisposable.Dispose on those objects. Strings are not such objects, this is not relevant to this question.
If, instead, by disposal you mean "garbage collection", then since I already established that neither of the two examples creates more or less objects than the others due to the differences you asked about, this is also irrelevant.
This is not important, however. It isn't important what you or I or your colleague thinks is going to have an effect. Knowing is quite different, which leads me to...
The real tip I can give about optimization:
Measure
Measure
Measure
Understand
Verify that you understand it correctly
Change, if possible
You measure, use a profiler to find the real bottlenecks and real time spenders in your code, then understand why those are bottlenecks, then ensure your understanding is correct, then you can see if you can change it.
In your code I will venture a guess that if you were to profile your program you would find that those two examples will have absolutely no effect whatsoever on the running time. If they do have effect it is going to be on order of nanoseconds. Most likely, the very act of looking at the profiler results will give you one or more "huh, that's odd" realizations about your program, and you'll find bottlenecks that are far bigger fish than the variables in play here.

In both of your alternatives, GetStringFromDataRow creates new string every time. Whether you store a reference to this string in a local variable or in argument parameter variable (which is essentially not much different from local variable in your case) does not matter. Imagine you even not assigned result of GetStringFromDataRow to any variable - instance of string is still created and stored somewhere in memory until garbage collected. If you would pass your strings by reference - it won't make much difference. You will be able to reuse memory location to store reference to created string (you can think of it as the memory address of string instance), but not memory location for string contents.

Memory assigned to strings

I know that strings in C# are immutable i.e. when I change the value of a string variable a new string variable with the same name is created with the new value and the older one is collected by GC. Am I right?
string s1 = "abc";
s1 = s1.Substring(0, 1);
If what I said is right, then my doubt is if a new string is created, then is it created in the same memory location?

if a new string is created, then is it created in the same memory location?
No, a separate string object is created, in a separate bit of memory.
You're then replacing the value of s1 with a reference to the newly-created string. That may or may not mean that the original string can be garbage collected - it depends on whether anything else has references to it. In the case of a string constant (as in your example, with a string literal) I suspect that won't be garbage collected anyway, although it's an implementation detail.
If you have:
string text = "original";
text = text.Substring(0, 5);
text = text.Substring(0, 3);
then the intermediate string created by the first call to Substring will be eligible for garbage collection, because nothing else refers to it. That doesn't mean it will be garbage collected immediately though, and it certainly doesn't mean that its memory will be reused for the string created by the final line.

String caching. Memory optimization and re-use

I am currently working on a very large legacy application which handles a large amount of string data gathered from various sources (IE, names, identifiers, common codes relating to the business etc). This data alone can take up to 200 meg of ram in the application process.
A colleague of mine mentioned one possible strategy to reduce the memory footprint (as a lot of the individual strings are duplicate across the data sets), would be to "cache" the recurring strings in a dictionary and re-use them when required. So for example…
public class StringCacher()
{
public readonly Dictionary<string, string> _stringCache;
public StringCacher()
{
_stringCache = new Dictionary<string, string>();
}
public string AddOrReuse(string stringToCache)
{
if (_stringCache.ContainsKey(stringToCache)
_stringCache[stringToCache] = stringToCache;
return _stringCache[stringToCache];
}
}
Then to use this caching...
public IEnumerable<string> IncomingData()
{
var stringCache = new StringCacher();
var dataList = new List<string>();
// Add the data, a fair amount of the strings will be the same.
dataList.Add(stringCache.AddOrReuse("AAAA"));
dataList.Add(stringCache.AddOrReuse("BBBB"));
dataList.Add(stringCache.AddOrReuse("AAAA"));
dataList.Add(stringCache.AddOrReuse("CCCC"));
dataList.Add(stringCache.AddOrReuse("AAAA"));
return dataList;
}
As strings are immutable and a lot of internal work is done by the framework to make them work in a similar way to value types i'm half thinking that this will just create a copy of each the string into the dictionary and just double the amount of memory used rather than just pass a reference to the string stored in the dictionary (which is what my colleague is assuming).
So taking into account that this will be run on a massive set of string data...
Is this going to save any memory, assuming that 30% of the string values will be used twice or more?
Is the assumption that this will even work correct?

This is essentially what string interning is, except you don't have to worry how it works. In your example you are still creating a string, then comparing it, then leaving the copy to be disposed of. .NET will do this for you in runtime.
See also String.Intern and Optimizing C# String Performance (C Calvert)
If a new string is created with code like (String goober1 = "foo"; String goober2 = "foo";) shown in lines 18 and 19, then the intern table is checked. If your string is already in there, then both variables will point at the same block of memory maintained by the intern table.
So, you don't have to roll your own - it won't really provide any advantage. EDIT UNLESS: your strings don't usually live for as long as your AppDomain - interned strings live for the lifetime of the AppDomain, which is not necessarily great for GC. If you want short lived strings, then you want a pool. From String.Intern:
If you are trying to reduce the total amount of memory your application allocates, keep in mind that interning a string has two unwanted side effects. First, the memory allocated for interned String objects is not likely be released until the common language runtime (CLR) terminates. The reason is that the CLR's reference to the interned String object can persist after your application, or even your application domain, terminates. ...
EDIT 2 Also see Jon Skeets SO answer here

This is already built-in .NET, it's called String.Intern, no need to reinvent.

You can acheive this using the built in .Net functionality.
When you initialise your string, make a call to string.Intern() with your string.
For example:
dataList.Add(string.Intern("AAAA"));
Every subsequent call with the same string will use the same reference in memory. So if you have 1000 AAAAs, only 1 copy of AAAA is stored in memory.

Strings and Garbage Collection

I have heard conflicting stories on this topic and am looking for a little bit of clarity.
How would one dispose of a string object immediately, or at the very least clear traces of it?

That depends. Literal strings are interned per default, so even if you application no longer references it it will not be collected, as it is referenced by the internal interning structure. Other strings are just like any other managed object. As soon as they are no longer reference by your application they are eligible for garbage collection.
More about interning here in this question: Where do Java and .NET string literals reside?

If you need to protect a string and be able to dispose it when you want, use System.Security.SecureString class.
Protect sensitive data with .NET 2.0's SecureString class

I wrote a little extension method for the string class for situations like this, it's probably the only sure way of ensuring the string itself is unreadable until collected. Obviously only works on dynamically generated strings, not literals.
public unsafe static void Clear(this string s)
{
fixed(char* ptr = s)
{
for(int i = 0; i < s.Length; i++)
{
ptr[i] = '\0';
}
}
}

This is all down to the garbage collector to handle that for you. You can force it to run a clean-up by calling GC.Collect(). From the docs:
Use this method to try to reclaim all
memory that is inaccessible.
All objects, regardless of how long
they have been in memory, are
considered for collection; however,
objects that are referenced in managed
code are not collected. Use this
method to force the system to try to
reclaim the maximum amount of
available memory.
That's the closest you'll get me thinks!!

I will answer this question from a security perspective.
If you want to destroy a string for security reasons, then it is probably because you don't want anyone snooping on your secret information, and you expect they might scan the memory, or find it in a page file or something if the computer is stolen or otherwise compromised.
The problem is that once a System.String is created in a managed application, there is not really a lot you can do about it. There may be some sneaky way of doing some unsafe reflection and overwriting the bytes, but I can't imagine that such things would be reliable.
The trick is to never put the info in a string at all.
I had this issue one time with a system that I developed for some company laptops. The hard drives were not encrypted, and I knew that if someone took a laptop, then they could easily scan it for sensitive info. I wanted to protect a password from such attacks.
The way I delt with it is this: I put the password in a byte array by capturing key press events on the textbox control. The textbox never contained anything but asterisks and single characters. The password never existed as a string at any time. I then hashed the byte array and zeroed the original. The hash was then XORed with a random hard-coded key, and this was used to encrypt all the sensitive data.
After everything was encrypted, then the key was zeroed out.
Naturally, some of the data might exist in the page file as plaintext, and it's also possible that the final key could be inspected as well. But nobody was going to steal the password dang it!

There's no deterministic way to clear all traces of a string (System.String) from memory. Your only options are to use a character array or a SecureString object.

One of the best ways to limit the lifetime of string objects in memory is to declare them as local variables in the innermost scope possible and not as private member variables on a class.
It's a common mistake for junior developers to declare their strings 'private string ...' on the class itself.
I've also seen well-meaning experienced developers trying to cache some complex string concatenation (a+b+c+d...) in a private member variable so they don't have to keep calculating it. Big mistake - it takes hardly any time to recalculate it, the temporary strings are garbage collected almost immediately when the first generation of GC happens, and the memory swallowed by caching all those strings just took available memory away from more important items like cached database records or cached page output.

Set the string variable to null once you don't need it.
string s = "dispose me!";
...
...
s = null;
and then call GC.Collect() to revoke garbage collector, but GC CANNOT guarantee the string will be collected immediately.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.