c# string pool reference - c#

If I have two strings with the same value, they should have the same reference, right?
here is my case:
string s1 = "aaa";
string s2 = "aaa";
Console.WriteLine(" s1: {0}; s2: {1}; equals: {2}", s1,s2, ReferenceEquals(s1, s2));
prints: s1: aaa; s2: aaa; equals: True
but take a look at this code:
string s1 = "aaa";
string s2 = new string(s1.ToCharArray());
Console.WriteLine(" s1: {0}; s2: {1}; equals: {2}", s1,s2, ReferenceEquals(s1, s2));
prints: s1: aaa; s2: aaa; equals: False
Why in the second case, the ReferenceEquals return false?

I've found an answer: only literal strings are saved in the intern pool
Interning literal strings is cheap at runtime and saves memory. Interning non-literal strings is expensive at runtime and therefore saves a tiny amount of memory in exchange for making the common cases much slower.
The cost of the interning-strings-at-runtime "optimization" does not pay for the benefit, and is therefore not actually an optimization. The cost of interning literal strings is cheap and therefore does pay for the benefit.

Related

Why is string interning failing here (or is it)? [duplicate]

string s1 = "test";
string s5 = s1.Substring(0, 3)+"t";
string s6 = s1.Substring(0,4)+"";
Console.WriteLine("{0} ", object.ReferenceEquals(s1, s5)); //False
Console.WriteLine("{0} ", object.ReferenceEquals(s1, s6)); //True
Both the strings s5 and s6 have same value as s1 ("test"). Based on string interning concept, both the statements must have evaluated to true. Can someone please explain why s5 didn't have the same reference as s1?
You should get false for calls of ReferenceEquals on string objects that are not string literals.
Essentially, the last line prints True by coincidence: what happens is that when you pass an empty string for string concatenation, library optimization recognizes this, and returns the original string. This has nothing to do with interning, as the same thing will happen with strings that you read from console or construct in any other way:
var s1 = Console.ReadLine();
var s2 = s1+"";
var s3 = ""+s1;
Console.WriteLine(
"{0} {1} {2}"
, object.ReferenceEquals(s1, s2)
, object.ReferenceEquals(s1, s3)
, object.ReferenceEquals(s2, s3)
);
The above prints
True True True
Demo.
The CLR doesn't intern all strings. All string literals are interned by default. The following, however:
Console.WriteLine("{0} ", object.ReferenceEquals(s1, s6)); //True
Returns true, since the line here:
string s6 = s1.Substring(0,4)+"";
Is effectively optimized to return the same reference back. It happens to (likely) be interned, but that's coincidental. If you want to see if a string is interned, you should use String.IsInterned()
If you want to intern strings at runtime, you can use String.Intern and store the reference, as per the MSDN documentation here: String.Intern Method (String). However, I strongly suggest you not use this method, unless you have a good reason to do so: it has performance considerations and potentially unwanted side-effects (for example, strings that have been interned cannot be garbage collected).
From msdn documentation of object.ReferenceEquals here:
When comparing strings.If objA and objB are strings, the ReferenceEquals method returns true if the string is interned.It does not perform a test for value equality.In the following example, s1 and s2 are equal because they are two instances of a single interned string.However, s3 and s4 are not equal, because although they are have identical string values, that string is not interned.
using System;
public class Example
{
public static void Main()
{
String s1 = "String1";
String s2 = "String1";
Console.WriteLine("s1 = s2: {0}", Object.ReferenceEquals(s1, s2));
Console.WriteLine("{0} interned: {1}", s1,
String.IsNullOrEmpty(String.IsInterned(s1)) ? "No" : "Yes");
String suffix = "A";
String s3 = "String" + suffix;
String s4 = "String" + suffix;
Console.WriteLine("s3 = s4: {0}", Object.ReferenceEquals(s3, s4));
Console.WriteLine("{0} interned: {1}", s3,
String.IsNullOrEmpty(String.IsInterned(s3)) ? "No" : "Yes");
}
}
// The example displays the following output:
// s1 = s2: True
// String1 interned: Yes
// s3 = s4: False
// StringA interned: No
Strings in .NET can be interned. It isn't said anywhere that 2 identical strings should be the same string instance. Typically, the compiler will intern identical string literals, but this isn't true for all strings, and is certainly not true of strings created dynamically at runtime.
The Substring method is smart enough to return the original string in the case where the substring being requested is exactly the original string. Link to the Reference Source found in comment by #DanielA.White. So s1.Substring(0,4) returns s1 when s1 is of length 4. And apparently the + operator has a similar optimization such that
string s6 = s1.Substring(0,4)+"";
is functionally equivalent to:
string s6 = s1;

What is .NET string structure and how .NET take care to store and retrieve it's string structure?

Let's face following code :
string a = "This is not a long string!";
string b = "Another string";
b = "This is" + " not a long " + "string" + "!";
Console.WriteLine(object.ReferenceEquals(a, b)); //True !
string c = "This is" + " not a long " + "string" + '!';
Console.WriteLine(object.ReferenceEquals(a, c)); //False
The only reason I see is that .NET has optimized variables to take less space.
Does .NET store strings with zero terminated[null] or string length ?
I mean when I write following code is it possible to lose the part after the null char if .NET run optimization against the string ?
string Waaaa = "This is not \0a long string!";
Strings in .NET are in essence character arrays - char[] (where char is a representation of a UTF-16 character). They are not C strings - they are not null terminated.
What you see is the result of string interning - any string literal will get interned (and the compiler is smart enough to know to convert concatenated strings to a single literal).
Your Waaaa variable will be exactly what you have posted - with a null character in the middle.

Is it faster to compare strings with Regex with IgnoreCase or with ToLower method of string?

Given strings like these:
string s1 = "Abc";
string s2 = "ABC";
What is faster:
Regex.Match(s1, s2, RegexOptions.IgnoreCase)
or
s1.ToLower() == s2.ToLower()
If they are the same or the one is faster then the other, so when its better to use one over the other?
Probably the second is faster, but I'd avoid both those approaches.
Better is to use the method string.Equals with the appropriate StringComparison argument:
s1.Equals(s2, StringComparison.CurrentCultureIgnoreCase)
See it working online: ideone
Theoretically speaking, comparing 2 strings should be faster, RegEx are know to be rather slow.
However, if you want to match a string s1 to a RegEx s2 while ignoring case (This is not the same as comparing 2 strings), then the first solution is better as it should avoid creating another string.
As always with this kind of questions, I would run a benchmark and compare both performances :)
#Mark Byers has already posted the right answer.
I want to stress that you should never use ToLower for string comparison. It is incorrect.
s1.Equals(s2, StringComparison.CurrentCultureIgnoreCase) //#1
s1.ToLower() == s2.ToLower() //#2
s1.ToLowerInvariant() == s2.ToLowerInvariant() //#3
(2) and (3) are both incorrect when it comes to exotic languages and strange characters. The Turkish "I" is the classical example.
Always use #1, even in Hashtables
(except for very special circumstances)
It should be noted that Regex.Match(s1, s2, RegexOptions.IgnoreCase) is not a safe way to check for case-insensitive equality in the general case. Consider the case where s2 is ".*". Regex.Match will always return true no matter what s1 is!
This may be the most extreme case of premature optimization I've ever seen. Trust me, you will never run into a situation where this issue will be relevant.
And don't listen to all those people who tell you to avoid regexes because "they're slow". Badly written regexes can indeed hog resources something awful, but that's the fault of whoever wrote the regex. Reasonably well-crafted regexes are plenty fast enough for the vast majority of tasks people apply them to.
Here is a small comparison of the 3 methods proposed:
Regex: 282ms
ToLower: 67ms
Equals: 34ms
public static void RunSnippet()
{
string s1 = "Abc";
string s2 = "ABC";
// Preload
compareUsingRegex(s1, s2);
compareUsingToLower(s1, s2);
compareUsingEquals(s1, s2);
// Regex
Stopwatch swRegex = Stopwatch.StartNew();
for (int i = 0; i < 300000; i++)
compareUsingRegex(s1, s2);
Console.WriteLine(string.Format("Regex: {0} ms", swRegex.ElapsedMilliseconds));
// ToLower
Stopwatch swToLower = Stopwatch.StartNew();
for (int i = 0; i < 300000; i++)
compareUsingToLower(s1, s2);
Console.WriteLine(string.Format("ToLower: {0} ms", swToLower.ElapsedMilliseconds));
// ToLower
Stopwatch swEquals = Stopwatch.StartNew();
for (int i = 0; i < 300000; i++)
compareUsingEquals(s1, s2);
Console.WriteLine(string.Format("Equals: {0} ms", swEquals.ElapsedMilliseconds));
}
private static bool compareUsingRegex(string s1, string s2)
{
return Regex.IsMatch(s1, s2, RegexOptions.IgnoreCase);
}
private static bool compareUsingToLower(string s1, string s2)
{
return s1.ToLower() == s2.ToLower();
}
private static bool compareUsingEquals(string s1, string s2)
{
return s1.Equals(s2, StringComparison.CurrentCultureIgnoreCase);
}
Comparing will be faster but Instead of converting to lower or upper case and then comparing, its better to use an equality comparison which can be made case-insensitive. E.g. :
s1.Equals(s2, StringComparison.OrdinalIgnoreCase)

c# string formatting

I m curious why would i use string formatting while i can use concatenation such as
Console.WriteLine("Hello {0} !", name);
Console.WriteLine("Hello "+ name + " !");
Why to prefer the first one over second?
You picked too simple of an example.
String formatting:
allows you to use the same variable multiple times: ("{0} + {0} = {1}", x, 2*x)
automatically calls ToString on its arguments: ("{0}: {1}", someKeyObj, someValueObj)
allows you to specify formatting: ("The value will be {0:3N} (or {1:P}) on {2:MMMM yyyy gg}", x, y, theDate)
allows you to set padding easily: (">{0,3}<", "hi"); // ">hi <"
You can trade the string for a dynamic string later.
For example:
// In a land far, far away
string nameFormat = "{0} {1}";
// In our function
string firstName = "John";
string lastName = "Doe";
Console.WriteLine(nameFormat, firstName, lastName);
Here, you can change nameFormat to e.g. "{1}, {0}" and you don't need to change any of your other code. With concatination, you would need to edit your code or possibly duplicate your code to handle both cases.
This is useful in localization/internationalization.
There isn't a singular correct answer to this question. There are a few issues you want to address:
Performance
The performance differences in your examples (and in real apps) are minimal. If you start writing MANY concatenations, you will gradually see better memory performance with the formatted string. Refer to Ben's answer
Readability
You will be better off with a formatted string when you have formatting, or have many different variables to stringify:
string formatString = "Hello {0}, today is {1:yyyy-MM-dd}";
Console.WriteLine(formatString, userName, Date.Today);
Extensibility
Your situation will determine what's best. You tell me which is better when you need to add an item between Username and Time in the log:
Console.WriteLine(
#"Error!
Username: " + userName + "
Time: " + time.ToString("HH:mm:ss") + "
Function: " + functionName + "
Action: " + actionName + "
status: " + status + "
---");
or
Console.WriteLine(#"Error!
Username: {0}
Time: {1}
Function: {2}
Action: {3}
status: {4}
---",
username, time.ToString("HH:mm:ss"), functionName, actionName, status);
Conclusion
I would choose the formatted string most of the time... But I wouldn't hesitate at all to use concatenation when it was easier.
I think the main thing here is readability. So I always choose for what have the best readability for each case.
Note:
With string interpolation of C# 6 your code could be simplified to this:
Console.WriteLine($"Hello {name}!");
Which I think better than your two suggested options.
String formatting allows you to keep the format string separate, and use it where it's needed properly without having to worry about concatenation.
string greeting = "Hello {0}!";
Console.WriteLine(greeting, name);
As for why you would use it in the exact example you gave... force of habit, really.
I think a good example is about i18n and l10n
If you have to change a string between different languages, this: "bla "+variable+"bla bla.."
Will give problems to a program used to create sobstitution for your strings if you use a different language
while in this way: "bla {0} blabla" is easily convertible (you will get {0} as part of the string)
Formatting is usually preferred for most of the reasons explained by other members here. There are couple more reasons I want to throw in from my short programming experience:
Formatting will help in generating Culture aware strings.
Formatting is more performant than concatenation. Remember, every concatenation operation will involve creation of temporary intermediate strings. If you have a bunch of strings you need to concatenate, you are better off using String.Join or StringBuilder.
You are using a trivial example where there is not much of a difference. However, if you have a long string with many parameters it is much nicer to be able to use a format string instead of many, many + signs and line breaks.
It also allows you to format numerical data as you wish, i.e., currency, dates, etc.
The first format is recommended. It allows you to specify specific formats for other types, like displaying a hex value, or displaying some specific string format.
e.g.
string displayInHex = String.Format("{0,10:X}", value); // to display in hex
It is also more consistent. You can use the same convention to display your Debug statement.
e.g.
Debug.WriteLine (String.Format("{0,10:X}", value));
Last but not least, it helps in the localisation of your program.
In addition to reasons like Ignacio's, many (including me) find String.Format-based code much easier to read and alter.
string x = String.Format(
"This file was last modified on {0} at {1} by {2}, and was last backed up {3}."
date, time, user, backupDate);
vs.
string x = "This file was last modified on " + date + " at "
+ time + " by " + user + " and was last backed up " + backupDate + ".";
I have found the former approach (using string.format) very useful when overriding ToString() methods in Entity classes.
For example, in my Product class;
public override string ToString()
{
return string.format("{0} : {1} ({2} / {3} / {4}",
this.id,
this.description,
this.brand,
this.model);
}
Then, when users decide they want the product to appear differently it's easy to change the order/contents or layout of the string that is returned.
Of course you still concatentate this string together but I feel string.Format makes the whole thing a bit more readable and maintainable.
I guess that's the short answer I'm giving then isn't it - readability and maintainability for lengthy or complex strings.
Since C# 6 release a few years ago, it is also possible to perform string interpolation. Per example this:
var n = 12.5847;
System.Console.WriteLine(string.Format("Hello World! {0:C2}", n));
Becomes that:
var n = 12.5847;
System.Console.WriteLine($"Hello World! {n:C2}");
And both give you this result:
Hello World! £12.58

Difference in String concatenation performance

I know you should use a StringBuilder when concatenating strings but I was just wondering if there is a difference in concatenating string variables and string literals. So, is there a difference in performance in building s1, s2, and s3?
string foo = "foo";
string bar = "bar";
string s1 = "foo" + "bar";
string s2 = foo + "bar";
string s3 = foo + bar;
In the case you present, it's actually better to use the concatenation operator on the string class. This is because it can pre-compute the lengths of the strings and allocate the buffer once and do a fast copy of the memory into the new string buffer.
And this is the general rule for concatenating strings. When you have a set number of items that you want to concatenate together (be it 2, or 2000, etc) it's better to just concatenate them all with the concatenation operator like so:
string result = s1 + s2 + ... + sn;
It should be noted in your specific case for s1:
string s1 = "foo" + "bar";
The compiler sees that it can optimize the concatenation of string literals here and transforms the above into this:
string s1 = "foobar";
Note, this is only for the concatenation of two string literals together. So if you were to do this:
string s2 = foo + "a" + bar;
Then it does nothing special (but it still makes a call to Concat and precomputes the length). However, in this case:
string s2 = foo + "a" + "nother" + bar;
The compiler will translate that into:
string s2 = foo + "another" + bar;
If the number of strings that you are concatenating is variable (as in, a loop which you don't know beforehand how many elements there are in it), then the StringBuilder is the most efficient way of concatenating those strings, as you will always have to reallcate the buffer to account for the new string entries being added (of which you don't know how many are left).
The compiler can concatenate literals at compile time, so "foo" + "bar" get compiled to "foobar" directly, and there's no need to do anything at runtime.
Other than that, I doubt there's any significant difference.
Your "knowledge" is incorrect. You should sometimes use a StringBuilder when concatenating strings. In particular, you should do it when you can't perform the concatenation all in one experession.
In this case, the code is compiled as:
string foo = "foo";
string bar = "bar";
string s1 = "foobar";
string s2 = String.Concat(foo, "bar");
string s3 = String.Concat(foo, bar);
Using a StringBuilder would make any of this less efficient - in particular it would push the concatenation for s1 from compile time to execution time. For s2 and s3 it would force the creation of an extra object (the StringBuilder) as well as probably allocating a string which is unnecessarily large.
I have an article which goes into more detail on this.
There is no difference between s2 and s3. The compiler will take care of s1 for you, and concatenate it during compile time.
I'd say that this should decide compiler. Because all your string-building can be optimized as values is already known.
I guess StringBuilder pre-allocates space for appending more strings. As You know + is binary operator so there is no way to build concatenation of more than two strings at a time. Thus if you want to do s4 = s1 + s2 + s3 it will require building intermediate string (s1+s2) and only after that s4.

Categories