Why is String.Concat not optimized to StringBuilder.Append?

Why is String.Concat not optimized to StringBuilder.Append? - c#

I found concatenations of constant string expressions are optimized by the compiler into one string.
Now with string concatenation of strings only known at run-time, why does the compiler not optimize string concatenation in loops and concatenations of say more than 10 strings to use StringBuilder.Append instead? I mean, it's possible, right? Instantiate a StringBuilder and take each concatenation and turn it into an Append() call.
Is there any reason why this should or could not be optimized? What am I missing?

The definite answer will have to come from the compiler design team. But let me take a stab here...
If your question is, why the compiler doesn't turn this:
string s = "";
for( int i = 0; i < 100; i ++ )
s = string.Concat( s, i.ToString() );
into this:
StringBuilder sb = new StringBuilder();
for( int i = 0; i < 100; i++ )
sb.Append( i.ToString() );
string s = sb.ToString();
The most likely answer is that this is not an optimization. This is a rewrite of the code that introduces new constructs based on knowledge and intent that the developer has - not the compiler.
This type of change would require the compiler to have more knowledge of the BCL than is appropriate. What if tomorrow, some more optimal string assembly service becomes available? Should the compiler use that?
What if your loop conditions were more complicated, should the compiler attempt to perform some static analysis to decide whether the result of such a rewrite would still be functionally equivalent? In many ways, this would be like solving the halting problem.
Finally, I'm not sure that in all cases this would result in faster performing code. There is a cost to instantiating a StringBuilder and resizing its internal buffer as text is appended. In fact, the cost of appending is strongly tied to the size of the string being concatenated, how many there are, what memory pressure looks like. These are things that the compiler cannot predict in advance.
It's your job as a developer to write well-performing code. The compiler can only help by making certain safe, invariant-preserving optimizations. Not rewriting your code for you.

LBuskin's answer is excellent; I have just a couple of things to add.
First, JScript.NET does do this optimization. JScript is frequently used by less-experienced programmers for tasks that involve construction of large strings in loops, like building up JSON objects, HTML data, and so on.
Since those programmers might not be aware of the n-squared cost of naive string allocation, might not be aware of the existence of string builders, and frequently write code using this pattern, we felt that it was reasonable to put this optimization into JScript.NET.
C# programmers tend to be more aware of the underlying costs of the code they write and more aware of the existence of off-the-shelf parts like StringBuilder, so they need this optimization less. And more fundamentally, the design philosophy of C# is that it is a "do what I said" language with a minimum of "magic"; JScript is a "do what I mean" language that does its best to figure out how to best serve you, even if that means sometimes guessing wrong. Both philosophies are valid and useful.
Sometimes it does "go the other way". Compare this choice to the choice we make for switches on strings. Switches on strings are actually compiled as a creation of a dictionary containing the strings, rather than as a series of string comparisons. That optimization could be bad; it might be faster to simply do the string comparisons. But here we make a guess that you "meant" the switch to be a table lookup rather than a series of "if" statements -- if you'd meant the series of if statements, you could easily write that yourself.

For a single concatenation of multiple strings (e.g. a + b + c + d + e + f + g + h + i + j) you really want to be using String.Concat IMO. It has the overhead of building an array for each call, but it has the benefit that the method can work out the exact length of the resulting string before it needs to allocate any memory. StringBuilder.Append(a).Append(b)... only gives a single value at a time, so the builder doesn't know how much memory to allocate.
As for doing it in loops - at that point you've added a new local variable, and you've got to add code to write back to the string variable at exactly the right time (calling StringBuilder.ToString()). What happens when you're running in the debugger? Wouldn't it be pretty confusing not to see the value building up, only becoming visible at the end of the loop? Oh, and of course you've got to perform appropriate validation that the value isn't used at any point before the end of the loop...

Two reasons:
You can't programmatically identify places where it would be strictly higher performing.
The "optimization" will slow things down if performed incorrectly.
You can suggest people use the correct calls for their application, but at some point it's the developer's responsibility to get it right.
Edit: Regarding the cutoff, we have another couple of problems:
The only way to know for sure that the cutoff is reached is complicated flow analysis. The number of places where this would be able to find sections that could be converted is extremely small.
Flow analysis is expensive. If you do it at runtime, the whole program will run slower for the rare chance that one piece of poorly written code will be faster. If you do it at compile time, it's not an error according to language syntax but you can issue a warning - and that's exactly what FXCop does (a slow but available flow analysis tool). Just think if FXCop always had to run with the compiler; so many hours people would be just waiting to run code. And if it was at runtime, well welcome to JVM startup times...

Because it's the compiler's job to generate semantically-correct code. Changing invocations of String.Concat to invocations of StringBuilder.Append would be changing the semantics of the code.

I believe it would be a little too complex for the compiler writers. And when you are referencing the intermediate strings inside the loops besides the concatenation (for example passing them to some other methods or so), this optimization would not be possible.

Probably because it's complicated to match such a pattern in the code, and in case the compiler can't do the match for some reason, the performance of the code is suddenly terrible. Optimising code like that would encourage writing code like that, which would even further increase the negative impact in the cases where the compiler can no longer do the optimisation.
For concatenating a known set of strings, StringBuilder is not faster than String.Concat.

A String is an immutable type, hence using concatenating the string is slower than using StringBuilder.Append.
Edit: To clarify my point a bit more, when you talk about why is String.Concat not optimized to StringBuilder.Append, a StringBuilder class has completely different semantics to the immutable type of String. Why should you expect the compiler to optimize that as they are clearly two different things? Furthermore, a StringBuilder is a mutable type that can change its length dynamically, why should a compiler optimize an immutable type to a mutable type? That is the design and semantics ingrained into the ECMA spec for the .NET Framework, regardless of the language.
It's a bit like asking the compiler (and perhaps expecting too much) to compile a char and optimize it into a int because the int works on 32 bits instead of 8 bits and would be deemed faster!

Related

Does the C# compiler inline lambda methods? [duplicate]

Do simple lambda expressions get inlined?
I have a tendency (thanks to F# and other functional forays) to encapsulate repeated code present within a single function into a lambda, and call it instead. I'm curious if I'm incurring a run-time overhead as a result:
var foo = a + b;
var bar = a + b;
vs
Func<T1, T2> op = () => a + b;
var foo = op();
var bar = op();
Which one costs more to run?

To answer the performance question: run it a billion times both ways. Measure the cost of each. Then you'll know. We have no idea what hardware you're using, what "noise" is present in your relevant scenarios, or what you consider to be an important performance metric. You're the only person who knows those things, so you're the only person who can answer the question.
To answer your codegen question: Jared is correct but the answer could be expanded upon.
First off, the C# compiler never does inlining of any code. The jit compiler does do inlining of code, but the fact that the C# compiler generates lambdas as delegate instances means that it is unlikely that the jitter can reasonably inline this code. (It is of course possible for the jitter to do this sophisticated analysis to determine that the same code is always in the delegate, but I do not believe that in practice those algorithms have been implemented.)
If you want the code to be inlined then you should write it in line. If you don't want to write it in line but you still want it inlined then you should write it as a static method and hope the jitter inlines it.
But regardless, this sounds like premature optimization. Write the code the way you want to write the code, and then analyze its performance, and then rewrite the slow stuff.

No. Lambda functions are not inlined but instead are stored as delegates under the hood and incur the same cost of execution as other delegates.

How long does variable assignment take?

I've frequently come across code that made liberal use of variables like var self = this; so their code would look nicer. While I don't think assignments like these are going to be significant at all in any piece of code, I've always wondered how long an assignment like above would take.
With that said: How long does it take, assuming it isn't optimized away? How do the times compare between different languages- e.g. C#, Java, and C++? Common value types (including pointers)? 32 / 64-bit architectures?
EDIT: Erased the part about "noticeable difference". I meant that part as a side-question, but many people have seen that and started downvoting me for premature optimization (in spite of me highlighting the bottleneck part in bold).

The code:
var self = this;
Isn't creating a new instance of object 'this', but it is referencing a pointer to the object 'this'. At the machine level there is only one pointer since the C# compiler optimizes these types of reference out. So the "how long it takes" is actually zero.
So, why do 'this'? Because it makes code easier to read.

Constant abuse?

I have run across a bunch of code in a few C# projects that have the following constants:
const int ZERO_RECORDS = 0;
const int FIRST_ROW = 0;
const int DEFAULT_INDEX = 0;
const int STRINGS_ARE_EQUAL = 0;
Has anyone ever seen anything like this? Is there any way to rationalize using constants to represent language constructs? IE: C#'s first index in an array is at position 0. I would think that if a developer needs to depend on a constant to tell them that the language is 0 based, there is a bigger issue at hand.
The most common usage of these constants is in handling Data Tables or within 'for' loops.
Am I out of place thinking these are a code smell? I feel that these aren't a whole lot better than:
const int ZERO = 0;
const string A = "A";

Am I out of place thinking these are a code smell? I feel that these aren't a whole lot better than:
Compare the following:
if(str1.CompareTo(str2) == STRINGS_ARE_EQUAL) ...
with
if(str1.CompareTo(str2) == ZERO) ...
if(str1.CompareTo(str2) == 0) ...
Which one makes more immediate sense?

Abuse, IMHO. "Zero" is just is one of the basics.
Although the STRINGS_ARE_EQUAL could be easy, why not ".Equals"?
Accepted limited use of magic numbers?

That definitely a code smell.
The intent may have been to 'add readability' to the code, however things like that actually decrease the readability of code in my opinion.

Some people consider any raw number within a program to be a 'magic number'. I have seen coding standards that basically said that you couldn't just write an integer into a program, it had to be a const int.

Am I out of place thinking these are a code smell? I feel that these aren't a whole lot better than:
const int ZERO = 0;
const int A = 'A';
Probably a bit of smell, but definitely better than ZERO=0 and A='A'. In the first case they're defining logical constants, i.e. some abstract idea (string equality) with a concrete value implementation.
In your example, you're defining literal constants -- the variables represent the values themselves. If this is the case, I would think that an enumeration is preferred since they rarely are singular values.

That is definite bad coding.
I say constants should be used only where needed where things could possible change sometime later. For instance, I have a lot of "configuration" options like SESSION_TIMEOUT defined where it should stay the same, but maybe it could be tweaked later on down the road. I do not think ZERO can ever be tweaked down the road.
Also, for magic numbers zero should not be included.
I'm a bit strange I think on that belief though because I would say something like this is going to far
//input is FIELD_xxx where xxx is a number
input.SubString(LENGTH_OF_FIELD_NAME); //cut out the FIELD_ to give us the number

You should have a look at some of the things at thedailywtf
One2Pt20462262185th
and
Enterprise SQL

I think sometimes people blindly follow 'Coding standards' which say "Don't use hardcoded values, define them as constants so that it's easier to manage the code when it needs to be updated' - which is fair enough for stuff like:
const in MAX_NUMBER_OF_ELEMENTS_I_WILL_ALLOW = 100
But does not make sense for:
if(str1.CompareTo(str2) == STRINGS_ARE_EQUAL)
Because everytime I see this code I need to search for what STRINGS_ARE_EQUAL is defined as and then check with docs if that is correct.
Instead if I see:
if(str1.CompareTo(str2) == 0)
I skip step 1 (search what STRINGS_ARE... is defined as) and can check specs for what value 0 means.
You would correctly feel like replacing this with Equals() and use CompareTo() in cases where you are interested in more that just one case, e.g.:
switch (bla.CompareTo(bla1))
{
case IS_EQUAL:
case IS_SMALLER:
case IS_BIGGER:
default:
}
using if/else statements if appropriate (no idea what CompareTo() returns ...)
I would still check if you defined the values correctly according to specs.
This is of course different if the specs defines something like ComparisonClass::StringsAreEqual value or something like that (I've just made that one up) then you would not use 0 but the appropriate variable.
So it depends, when you specifically need to access first element in array arr[0] is better than arr[FIRST_ELEMENT] because I will still go and check what you have defined as FIRST_ELEMENT because I will not trust you and it might be something different than 0 - for example your 0 element is dud and the real first element is stored at 1 - who knows.

I'd go for code smell. If these kinds of constants are necessary, put them in an enum:
enum StringEquality
{
Equal,
NotEqual
}
(However I suspect STRINGS_ARE_EQUAL is what gets returned by string.Compare, so hacking it to return an enum might be even more verbose.)
Edit: Also SHOUTING_CASE isn't a particularly .NET-style naming convention.

i don't know if i would call them smells, but they do seem redundant. Though DEFAULT_INDEX could actually be useful.
The point is to avoid magic numbers and zeros aren't really magical.

Is this code something in your office or something you downloaded?
If it's in the office, I think it's a problem with management if people are randomly placing constants around. Globally, there shouldn't be any constants unless everyone has a clear idea or agreement of what those constants are used for.
In C# ideally you'd want to create a class that holds constants that are used globally by every other class. For example,
class MathConstants
{
public const int ZERO=0;
}
Then in later classes something like:
....
if(something==MathConstants.ZERO)
...
At least that's how I see it. This way everyone can understand what those constants are without even reading anything else. It would reduce confusion.

There are generally four reasons I can think of for using a constant:
As a substitute for a value that could reasonably change in the future (e.g., IdColumnNumber = 1).
As a label for a value that may not be easy to understand or meaningful on its own (e.g. FirstAsciiLetter = 65),
As a shorter and less error-prone way of typing a lengthy or hard to type value (e.g., LongSongTitle = "Supercalifragilisticexpialidocious")
As a memory aid for a value that is hard to remember (e.g., PI = 3.14159265)
For your particular examples, here's how I'd judge each example:
const int ZERO_RECORDS = 0;
// almost definitely a code smell
const int FIRST_ROW = 0;
// first row could be 1 or 0, so this potentially fits reason #2,
// however, doesn't make much sense for standard .NET collections
// because they are always zero-based
const int DEFAULT_INDEX = 0;
// this fits reason #2, possibly #1
const int STRINGS_ARE_EQUAL = 0;
// this very nicely fits reason #2, possibly #4
// (at least for anyone not intimately familiar with string.CompareTo())
So, I would say that, no, these are not worse than Zero = 0 or A = "A".

If the zero indicates something other than zero (in this case STRINGS_ARE_EQUAL) then that IS Magical. Creating a constant for it is both acceptable and makes the code more readable.
Creating a constant called ZERO is pointless and a waste of finger energy!

Smells a bit, but I could see cases where this would make sense, especially if you have programmers switching from language to language all the time.
For instance, MATLAB is one-indexed, so I could imagine someone getting fed up with making off-by-one mistakes whenever they switch languages, and defining DEFAULT_INDEX in both C++ and MATLAB programs to abstract the difference. Not necessarily elegant, but if that's what it takes...

Right you are to question this smell young code warrior. However, these named constants derive from coding practices much older than the dawn of Visual Studio. They probably are redundant but you could do worse than to understand the origin of the convention. Think NASA computers, way back when...

You might see something like this in a cross-platform situation where you would use the file with the set of constants appropriate to the platform. But Probably not with these actual examples. This looks like a COBOL coder was trying to make his C# look more like english language (No offence intended to COBOL coders).

It's all right to use constants to represent abstract values, but quite another to represent constructs in your own language.
const int FIRST_ROW = 0 doesn't make sense.
const int MINIMUM_WIDGET_COUNT = 0 makes more sense.
The presumption that you should follow a coding standard makes sense. (That is, coding standards are presumptively correct within an organization.) Slavishly following it when the presumption isn't met doesn't make sense.
So I agree with the earlier posters that some of the smelly constants probably resulted from following a coding standard ("no magic numbers") to the letter without exception. That's the problem here.

Do lambdas get inlined?

Do simple lambda expressions get inlined?
I have a tendency (thanks to F# and other functional forays) to encapsulate repeated code present within a single function into a lambda, and call it instead. I'm curious if I'm incurring a run-time overhead as a result:
var foo = a + b;
var bar = a + b;
vs
Func<T1, T2> op = () => a + b;
var foo = op();
var bar = op();
Which one costs more to run?

To answer the performance question: run it a billion times both ways. Measure the cost of each. Then you'll know. We have no idea what hardware you're using, what "noise" is present in your relevant scenarios, or what you consider to be an important performance metric. You're the only person who knows those things, so you're the only person who can answer the question.
To answer your codegen question: Jared is correct but the answer could be expanded upon.
First off, the C# compiler never does inlining of any code. The jit compiler does do inlining of code, but the fact that the C# compiler generates lambdas as delegate instances means that it is unlikely that the jitter can reasonably inline this code. (It is of course possible for the jitter to do this sophisticated analysis to determine that the same code is always in the delegate, but I do not believe that in practice those algorithms have been implemented.)
If you want the code to be inlined then you should write it in line. If you don't want to write it in line but you still want it inlined then you should write it as a static method and hope the jitter inlines it.
But regardless, this sounds like premature optimization. Write the code the way you want to write the code, and then analyze its performance, and then rewrite the slow stuff.

No. Lambda functions are not inlined but instead are stored as delegates under the hood and incur the same cost of execution as other delegates.

Is it costly to do array.length or list.count in a loop

I know that in JavaScript, creating a for loop like this: for(int i = 0; i < arr.length; i++) is costly as it computes the array length each time.
Is this behavior costly in c# for lists and arrays as well. Or at compile-time is it optimized? Also what about other languages such as Java, how is this handled?

It is not costly in C#. For one thing, there is no “calculation“: querying the length is basically an elementary operation thanks to inlining. And secondly, because (according to its developers), the compiler recognizes this pattern of access and will in fact optimize any (redundant) boundary checks for access on array elements.
And by the way, I believe that something similar is true for modern JavaScript virtual machines, and if it isn't already, it will be very soon since this is a trivial optimization.

All .Net arrays have a field containing the length of the array, so the length is not computed at usage but at creation time.
The .Net virtual machine is very good at eliminating bounds checks whenever possible, this is one of those cases, where the bounds check is moved outside the loop (in most situations, and if not it's just 2 instructions overhead).
Edit:
Array Bounds Check Elimination

In almost any language, the answer will be "it depends".
Mostly, it depends on whether the compiler is clever enough to be able to tell whether the length of the list or array might change whilst you're in the loop.
That's unlikely to be defined by the language specification, though.
So, it's probably safe to assume that the compile may not be able to figure that out. If you really truly believe that the length of the object won't change, feel free to calculate the length first and use that in your loop control constructs.
But beware of other threads...

I believe if you use the Linq Count() extension method, then it may calculate every time it's called.

If it's anything like Java, it should be an O(1) operation.
I found the following link helpful: http://www.devguru.com/Technologies/Ecmascript/Quickref/array.html

It will also depend on whether that getter is doing a calculation, or accessing a known value.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.