C# string concatenation and string interning - c#

When performing string concatentation of an existing string in the intern pool, is a new string entered into the intern pool or is a reference returned to the existing string in the intern pool? According to this article, String.Concat and StringBuilder will insert new string instances into the intern pool?
http://community.bartdesmet.net/blogs/bart/archive/2006/09/27/4472.aspx
Can anyone explain how concatenation works with the intern pool?

If you create new strings, they will not automatically be put into the intern pool, unless you concatenate constants compile-time, in which case the compiler will create one string result and intern that as part of the JIT process.

You can see whether a string has been interned by calling String.IsInterned. The call will return a new string that is either a reference to an interned string equal to the string that was passed as an argument, or null if the string was not interned.

Related

How many string object are created when using string concatenation in C#

I'm a beginner in C#, just have some question on string concatenation.
string str = "My name is";
str += "John"
Q1-Does C#(.NET) have the same concept string pool in Java?
Q2- how many string object are created?
Q1-Does C#(.NET) have the same concept string pool in Java?
T̶h̶e̶ ̶a̶n̶s̶w̶e̶r̶ ̶i̶s̶ ̶n̶o̶,̶ ̶u̶s̶i̶n̶g̶ ̶s̶t̶r̶i̶n̶g̶s̶ ̶i̶n̶ ̶C̶#̶ ̶i̶s̶ ̶n̶o̶t̶ ̶l̶i̶k̶e̶ ̶t̶h̶e̶ ̶s̶t̶r̶i̶n̶g̶ ̶p̶o̶o̶l̶ ̶i̶n̶ ̶j̶a̶v̶a̶, each string is its own reference;
Correction : I had to research this for Java... It is conceptually the same thing, i was mistaken about the details of Javas string pool
C# commonly calls it string interning
You can read more about it here at Fabulous Adventures In Coding : Eric Lippert's Erstwhile Blog
String interning and String.Empty
If you have two identical string literals in one compilation unit then
the code we generate ensures that only one string object is created by
the CLR for all instances of that literal within the assembly. This
optimization is called "string interning".
String interning is a CLI feature that reuses a string instance in certain situations :
string literals, created via the ldstr IL command
When invoked explicitly using string.Intern
Q2- how many string object are created?
Because strings in C# are immutable, you get 3 string allocations out of your 2 statements
// 1st string
string str = "My name is";
// 2nd string
// "John"
// 3rd string, which is the concatenation of the first 2
str += "John"
Yes, there is such a thing.
The common language runtime conserves string storage by maintaining a table, called the intern pool, that contains a single reference to each unique literal string declared or created programmatically in your program. Consequently, an instance of a literal string with a particular value only exists once in the system.
source
In your case, I believe there will be three allocations.

Optimizing string manipulation

It is 2019 and we have a banking project which uses mainframe as data store and transactions.
We are using DTO's (Commarea, plain c# class) that is converted to plain string (this is how mainframe works) then sent to Mainframe.
While converting a class to string representation we use several string operations such as substring, pad left, pad right, trim etc.
As you can imagine, this causes several string allocations and hence garbage collection. It is usually at generation 0 but still.
Especially types like Decimal which is a Pack type in mainframe that fits into 8 bytes creates several strings.
I tried using ReadonlySpan<char> for example for substring. See example.
However, there are operations like PadRight, PadLeft which is not avaiable, because it is a read only span.
Update:
To clarify a part of conversion happens as follows:
val.Trim().Substring(5).PadRight(10);
I know that this creates 3 string. I know strings are immutable. My question is about doing the above operation with ReadonlySpan or Memory.
I can not use ReadonlySpan only for substring because as soon as I call ToString method I m losing the benefits.
I have to call ToString all the way at the end.
Is there another construct that supports other operations behind substring, that I can actually add remove data to the memory?
Thanks.
Using ReadOnlySpan can help reduce the number of string allocations in your code, but it won't eliminate them completely. This is because ReadOnlySpan is a read-only view of a sequence of characters, so you cannot modify the underlying data using a ReadOnlySpan.
To avoid unnecessary string allocations, you can use the string.AsSpan() method to get a ReadOnlySpan view of a string, and then use the Span.Slice() method to get substrings without allocating new strings. For example, you could use the following code to get a substring of a string without allocating a new string:
string val = "Hello world";
ReadOnlySpan<char> span = val.AsSpan();
ReadOnlySpan<char> substring = span.Slice(5);
However, as mentioned earlier, you cannot use ReadOnlySpan to modify the underlying data, so you will still need to allocate new strings for operations like PadRight and PadLeft. To avoid these allocations, you can use a StringBuilder to build up the string piece by piece, and then call ToString() on the StringBuilder when you're done. This will allow you to perform string operations without allocating new strings for each operation.
In summary, using ReadOnlySpan can help reduce the number of string allocations in your code, but it won't eliminate them completely. To avoid allocating new strings for each string operation, you can use a StringBuilder to build up the final string piece by piece.
string val = "Hello world";
StringBuilder builder = new StringBuilder(val.Length);
// Trim the string
builder.Append(val.Trim());
// Get a substring starting at the 5th character
builder.Append(val, 5, val.Length - 5);
// Pad the string with spaces to the right, to make it 10 characters long
builder.PadRight(10, ' ');
// Convert the final string to a regular string
string result = builder.ToString();

Why does `String.Trim()` not trim the object itself?

Not often but sometimes I need to use String.Trim() to remove whitespaces of a string.
If it was a longer time since last trim coding I write:
string s = " text ";
s.Trim();
and be surprised why s is not changed. I need to write:
string s = " text ";
s = s.Trim();
Why are some string methods designed in this (not very intuitive) way? Is there something special with strings?
Strings are immutable. Any string operation generates a new string without changing the original string.
From MSDN:
Strings are immutable--the contents of a string object cannot be
changed after the object is created, although the syntax makes it
appear as if you can do this.
s.Trim() creates a new trimmed version of the original string and returns it instead of storing the new version in s. So, what you have to do is to store the trimmed instance in your variable:
s = s.Trim();
This pattern is followed in all the string methods and extension methods.
The fact that string is immutable doesn't have to do with the decision to use this pattern, but with the fact of how strings are kept in memory. This methods could have been designed to create the new modified string instance in memory and point the variable to the new instance.
It's also good to remember that if you need to make lots of modifications to a string, it's much better to use an StringBuilder, which behaves like a "mutable" string, and it's much more eficient doing this kind of operations.
As it is written in MSDN Library:
A String object is called immutable (read-only), because its value
cannot be modified after it has been created. Methods that appear to
modify a String object actually return a new String object that
contains the modification.
Because strings are immutable, string manipulation routines that
perform repeated additions or deletions to what appears to be a single
string can exact a significant performance penalty.
See this link.
In addition to all the good answers, I also feel that the reason being Threadsaftey.
Lets say
string s = " any text ";
s.Trim();
When you say this there is nothing stopping the other thread from modifying s. If the same string is modified, lets say the other thread remove 'a' from s, then what is the result of s.Trim()?
But when it returns the new string, though it is being modified by the other thread, the trim can make a local copy modify it and return modified string.

Why string.Replace("X","Y") works only when assigned to new string?

I guess it has to do something with string being a reference type but I dont get why simply string.Replace("X","Y") does not work?
Why do I need to do string A = stringB.Replace("X","Y")? I thought it is just a method to be done on specified instance.
EDIT: Thank you so far. I extend my question: Why does b+="FFF" work but b.Replace does not?
Because strings are immutable. Any time you change a string .net creates creates a new string object. It's a property of the class.
Immutable objects
String Object
Why doesn't stringA.Replace("X","Y") work?
Why do I need to do stringB = stringA.Replace("X","Y"); ?
Because strings are immutable in .NET. You cannot change the value of an existing string object, you can only create new strings. string.Replace creates a new string which you can then assign to something if you wish to keep a reference to it. From the documentation:
Returns a new string in which all occurrences of a specified string in the current instance are replaced with another specified string.
Emphasis mine.
So if strings are immutable, why does b += "FFF"; work?
Good question.
First note that b += "FFF"; is equivalent to b = b + "FFF"; (except that b is only evaluated once).
The expression b + "FFF" creates a new string with the correct result without modifying the old string. The reference to the new string is then assigned to b replacing the reference to the old string. If there are no other references to the old string then it will become eligible for garbage collection.
Strings are immutable, which means that once they are created, they cannot be changed anymore. This has several reasons, as far as I know mainly for performance (how strings are represented in memory).
See also (among many):
http://en.wikipedia.org/wiki/Immutable_object
http://channel9.msdn.com/forums/TechOff/58729-Why-are-string-types-immutable-in-C/
As a direct consequence of that, each string operation creates a new string object. In particular, if you do things like
foreach (string msg in messages)
{
totalMessage = totalMessage + message;
totalMessage = totalMessage + "\n";
}
you actually create potentially dozens or hundreds of string objects. So, if you want to manipulate strings more sophisticatedly, follow GvS's hint and use the StringBuilder.
Strings are immutable. Any operation changing them has to create a new string.
A StringBuilder supports the inline Replace method.
Use the StringBuilder if you need to do a lot of string manipulation.
Why "b+="FFF"works but the b.replace is not
Because the += operator assigns the results back to the left hand operand, of course. It's just a short hand for b = b + "FFF";.
The simple fact is that you can't change any string in .Net. There are no instance methods for strings that alter the content of that string - you must always assign the results of an operation back to a string reference somewhere.
Yes its a method of System.String. But you can try
a = a.Replace("X","Y");
String.Replace is a shared function of string class that returns a new string. It is not an operator on the current object. b.Replace("a","b") would be similar to a line that only has c+1. So just like c=c+1 actually sets the value of c+1 to c, b=b.Replace("a","b") sets the new string returned to b.
As everyone above had said, strings are immutable.
This means that when you do your replace, you get a new string, rather than changing the existing string.
If you don't store this new string in a variable (such as in the variable that it was declared as) your new string won't be saved anywhere.
To answer your extended question, b+="FFF" is equivalent to b = b + "FFF", so basically you are creating a new string here also.
Just to be more explicit. string.Replace("X","Y") returns a new string...but since you are not assigning the new string to anything the new string is lost.

Is it because of string pooling by CLR or by the GetHashCode() method?

Is it because of string pooling by CLR or by the GetHashCode() method of both strings return same value?
string s1 = "xyz";
string s2 = "xyz";
Console.WriteLine(" s1 reference equals s2 : {0}", object.ReferenceEquals(s1, s2));
Console writes : "s1 reference equals s2 : True"
I believe that, it's not because of the GetHashCode() returns same value for both string instance. Because, I tested with custom object and overridden the GetHasCode() method to return a single constant every time. The two separate instances of this object does not equal in the reference.
Please let me know, what is happening behind the scene.
thanks
123Developer
It sounds like string interning - a method of storing only one copy of a string. It requires strings to be an immutable type in the language you are dealing with, and .Net satisfies that and uses string interning.
In string interning a string "xyz" is stored in the intern pool, and whenever you say "xyz" internally it references the entry in the pool. This can save space by only storing the string once. So a comparison of "xyz" == "xyz" will get interpreted as [pointer to 34576] == [pointer to 34576] which is true.
This is definitely due to string interning. Hash codes are never calculated when comparing references with object.ReferenceEquals.
From the C# spec, section 2.4.4.5:
Each string literal does not
necessarily result in a new string
instance. When two or more string
literals that are equivalent according
to the string equality operator
(§7.9.7) appear in the same program,
these string literals refer to the
same string instance.
Note that string constant expressions count as literals in this case, so:
string x = "a" + "b";
string y = "ab";
It's guaranteed that x and y refer to the same object too (i.e. they are the same references).
When the spec says "program" by the way, it really means "assembly". The behaviour of equal strings in different assemblies depends on things like CompilationRelaxations.NoStringInterning and the precise CLR implementation and execution time situation (e.g. whether the assembly is ngen'd or not).
It's similar to string pooling, but it's not done at runtime but at compile time.
Any string literal in an assembly only exists once. The compiler uses the same constant string for all occurances of the string literal "xyz". As strings are immutable (you can never change the value of a string instance), the compiler can safely use the same string instance for separate string references.
If you instead create a string at runtime, you get a separate instance:
string s1 = "xyz";
string s2 = "xy";
s2 += "z";
Console.WriteLine("s1 ref = s2 : {0}", object.ReferenceEquals(s1, s2));
Output:
s1 ref = s2 : False
Totally agree with Tom's answer...
Excerpt from CIL Specification (page 126):
The CLI guarantees that the result of
two ldstr instructions referring to
two metadata tokens that have the same
sequence of characters, return
precisely the same string object (a
process known as “string interning”).
string interning has nothing to do with it.
I would be very surprise to find up that .NET/C# compiler calls Intern implicitly, It takes too much stress on the CPU to check for matching string at runtime.

Categories