StringBuilder: how to get the final String? - c#

Someone told me that it's faster to concatenate strings with StringBuilder. I have changed my code but I do not see any Properties or Methods to get the final build string.
How can I get the string?

You can use .ToString() to get the String from the StringBuilder.

Once you have completed the processing using the StringBuilder, use the ToString method to return the final result.
From MSDN:
using System;
using System.Text;
public sealed class App
{
static void Main()
{
// Create a StringBuilder that expects to hold 50 characters.
// Initialize the StringBuilder with "ABC".
StringBuilder sb = new StringBuilder("ABC", 50);
// Append three characters (D, E, and F) to the end of the StringBuilder.
sb.Append(new char[] { 'D', 'E', 'F' });
// Append a format string to the end of the StringBuilder.
sb.AppendFormat("GHI{0}{1}", 'J', 'k');
// Display the number of characters in the StringBuilder and its string.
Console.WriteLine("{0} chars: {1}", sb.Length, sb.ToString());
// Insert a string at the beginning of the StringBuilder.
sb.Insert(0, "Alphabet: ");
// Replace all lowercase k's with uppercase K's.
sb.Replace('k', 'K');
// Display the number of characters in the StringBuilder and its string.
Console.WriteLine("{0} chars: {1}", sb.Length, sb.ToString());
}
}
// This code produces the following output.
//
// 11 chars: ABCDEFGHIJk
// 21 chars: Alphabet: ABCDEFGHIJK

When you say "it's faster to concatenate strings with a StringBuilder", this is only true if you are repeatedly (I repeat - repeatedly) concatenating to the same object.
If you're just concatenating 2 strings and doing something with the result immediately as a string, there's no point to using StringBuilder.
I just stumbled on Jon Skeet's nice write up of this:
https://jonskeet.uk/csharp/stringbuilder.html
If you are using StringBuilder, then to get the resulting string, it's just a matter of calling ToString() (unsurprisingly).

I would just like to throw out that is may not necessarily faster, it will definitely have a better memory footprint. This is because string are immutable in .NET and every time you change a string you have created a new one.

About it being faster/better memory:
I looked into this issue with Java, I assume .NET would be as smart about it.
The implementation for String is pretty impressive.
The String object tracks "length" and "shared" (independent of the length of the array that holds the string)
So something like
String a = "abc" + "def" + "ghi";
can be implemented (by the compiler/runtime) as:
- Extend the array holding "abc" by 6 additional spaces.
- Copy def in right after abc
- copy ghi in after def.
- give a pointer to the "abc" string to a
- leave abc's length at 3, set a's length to 9
- set the shared flag in both.
Since most strings are short-lived, this makes for some VERY efficient code in many cases.
The case where it's absolutely NOT efficient is when you are adding to a string within a loop, or when your code is like this:
a = "abc";
a = a + "def";
a += "ghi";
In this case, you are much better off using a StringBuilder construct.
My point is that you should be careful whenever you optimize, unless you are ABSOLUTELY sure that you know what you are doing, AND you are absolutely sure it's necessary, AND you test to ensure the optimized code makes a use case pass, just code it in the most readable way possible and don't try to out-think the compiler.
I wasted 3 days messing with strings, caching/reusing string-builders and testing speed before I looked at the string source code and figured out that the compiler was already doing it better than I possibly could for my use case. Then I had to explain how I didn't REALLY know what I was doing, I only thought I did...

It's not faster to concat - As smaclell pointed out, the issue is the immutable string forcing an extra allocation and recopying of existing data.
"a"+"b"+"c" is no faster to do with string builder, but repeated concats with an intermediate string gets faster and faster as the # of concat's gets larger like:
x = "a"; x+="b"; x+="c"; ...

Related

Optimizing string manipulation

It is 2019 and we have a banking project which uses mainframe as data store and transactions.
We are using DTO's (Commarea, plain c# class) that is converted to plain string (this is how mainframe works) then sent to Mainframe.
While converting a class to string representation we use several string operations such as substring, pad left, pad right, trim etc.
As you can imagine, this causes several string allocations and hence garbage collection. It is usually at generation 0 but still.
Especially types like Decimal which is a Pack type in mainframe that fits into 8 bytes creates several strings.
I tried using ReadonlySpan<char> for example for substring. See example.
However, there are operations like PadRight, PadLeft which is not avaiable, because it is a read only span.
Update:
To clarify a part of conversion happens as follows:
val.Trim().Substring(5).PadRight(10);
I know that this creates 3 string. I know strings are immutable. My question is about doing the above operation with ReadonlySpan or Memory.
I can not use ReadonlySpan only for substring because as soon as I call ToString method I m losing the benefits.
I have to call ToString all the way at the end.
Is there another construct that supports other operations behind substring, that I can actually add remove data to the memory?
Thanks.
Using ReadOnlySpan can help reduce the number of string allocations in your code, but it won't eliminate them completely. This is because ReadOnlySpan is a read-only view of a sequence of characters, so you cannot modify the underlying data using a ReadOnlySpan.
To avoid unnecessary string allocations, you can use the string.AsSpan() method to get a ReadOnlySpan view of a string, and then use the Span.Slice() method to get substrings without allocating new strings. For example, you could use the following code to get a substring of a string without allocating a new string:
string val = "Hello world";
ReadOnlySpan<char> span = val.AsSpan();
ReadOnlySpan<char> substring = span.Slice(5);
However, as mentioned earlier, you cannot use ReadOnlySpan to modify the underlying data, so you will still need to allocate new strings for operations like PadRight and PadLeft. To avoid these allocations, you can use a StringBuilder to build up the string piece by piece, and then call ToString() on the StringBuilder when you're done. This will allow you to perform string operations without allocating new strings for each operation.
In summary, using ReadOnlySpan can help reduce the number of string allocations in your code, but it won't eliminate them completely. To avoid allocating new strings for each string operation, you can use a StringBuilder to build up the final string piece by piece.
string val = "Hello world";
StringBuilder builder = new StringBuilder(val.Length);
// Trim the string
builder.Append(val.Trim());
// Get a substring starting at the 5th character
builder.Append(val, 5, val.Length - 5);
// Pad the string with spaces to the right, to make it 10 characters long
builder.PadRight(10, ' ');
// Convert the final string to a regular string
string result = builder.ToString();

Parsing a string to get a specific value

I'm new to C#. I'm parsing for a lot number in a 2D barcode. The actual lot number 'A2351' is hidden in this barcode string "+M727PP011/$$3201001A2351S". I would like to break this barcode up in separate string blocks but the delimiters are not consistent.
The letter prefix in front of the 4 digit lot number can be a 'A', 'P', or a 'D' There is a single letter following the lot number that can be ignored.
string Delimiter = "/$$3";
//barcode format:M###PP###/$$3 ddmmyy lotnumprefix 'A' followed by lotNum
string lotNum= "+M727PP011/$$3201001A2351S";
string[] split = lotNum.Split(new[] {Delimiter}, StringSplitOptions.None);
How do I extract the lot number after the date?
Based on your initial example and then the subsequent edit in which you showed how you are solving this, it sounds like the lot number is always in the same place. It would be cleaner (and more in line with standard C# code) to use a single call to string.Substring(int,int) rather than the two lines you are using which also require pulling in the VB library. You just need to call Substring and give it the starting index and the length.
So this code:
string lotNum = Strings.Right(barcode, 6);
lotNum = lotNum.Remove((lotNum.Length - 1), 1);
Can be done with this single substring call:
string lotNum = barcode.Substring(barcode.Length - 6, 5);
Edit
Just further clarification on why it might be better to use the call to Substring. In C# string objects are immutable. That means that when you make the call to Strings.Right you are getting back a new string object. When you then call lotNum.Remove you do not "remove" a character from the existing string, a new string is allocated with the character(s) removed and is returned to you. So with your code there are two new string allocations when trying to extract the lot number. When you make the call to Substring you will get back a new string, but instead of getting a new string that you immediately then modify and get a second new string, you will only need to allocate one new string to extract the lot number. In the example you have given there probably would not be any noticeable performance/memory issue, but it is something that could potentially lead to trouble if this code was in a tight loop or something like that.
If you're just trying to get the lot number, it's really dependent on the format of the input string (is it a consistent length, are there any reliable prefixes/suffixes relative to the data you're trying to parse that you can reference from, etc). It looks like your data is definable by its static position in the string, so it looks like you could use the substring
(with an index of 20?) method to accomplish what you want.

Declaring long strings that use string interpolation in C# 6

I usually wrap long strings by concatenating them:
Log.Debug("I am a long string. So long that I must " +
"be on multiple lines to be feasible.");
This is perfectly efficient, since the compiler handles concatenation of string literals. I also consider it the cleanest way to handle this problem (the options are weighed here).
This approach worked well with String.Format:
Log.Debug(String.Format("Must resize {0} x {1} image " +
"to {2} x {3} for reasons.", image.Width, image.Height,
resizedImage.Width, resizedImage.Height));
However, I now wish to never use String.Format again in these situations, since C# 6's string interpolation is much more readable. My concern is that I no longer have an efficient, yet clean way to format long strings.
My question is if the compiler can somehow optimize something like
Log.Debug($"Must resize {image.Width} x {image.Height} image " +
$"to {resizedImage.Width} x {resizedImage.Height} for reasons.");
into the above String.Format equivalent or if there's an alternative approach that I can use that won't be less efficient (due to the unnecessary concatenation) while also keeping my code cleanly structured (as per the points raised in the link above).
This program:
var name = "Bobby Tables";
var age = 8;
String msg = $"I'm {name} and" +
$" I'm {age} years old";
is compiled as if you had written:
var name = "Bobby Tables";
var age = 8;
String msg = String.Concat(String.Format("I'm {0} and", name),
String.Format(" I'm {0} years old", age));
You see the difficulty in getting rid of the Concat - the compiler has re-written our interpolation literals to use the indexed formatters that String.Format expects, but each string has to number its parameters from 0. Naively concatenating them would cause them both to insert name. To get this to work out correctly, there would have to be state maintained between invocations of the $ parser so that the second string is reformatted as " I'm {1} years old". Alternatively, the compiler could try to apply the same kind of analysis it does for concatenation of string literals. I think this would be a legal optimization even though string interpolation can have side effects, but I wouldn't be surprised if it turned out there was a corner case under which interpolated string concatenation changed program behavior. Neither sounds impossible, especially given the logic is already there to detect a similar condition for string literals, but I can see why this feature didn't make it into the first release.
I would write the code in the way that you feel is cleanest and most readable, and not worry about micro-inefficiencies unless they prove to be a problem. The old saying about code being primarily for humans to understand holds here.
Maybe it would be not as readable as with + but by all means, it is possible. You just have to break line between { and }:
Log.Debug($#"Must resize {image.Width} x {image.Height} image to {
resizedImage.Width} x {resizedImage.Height} for reasons.");
SO's colouring script does not handle this syntax too well but C# compiler does ;-)
In the specialized case of using this string in HTML (or parsing with whatever parser where multiple whitespaces does not matter), I could recommend you to use #$"" strings (verbatim interpolated string) eg.:
$#"some veeeeeeeeeeery long string {foo}
whatever {bar}"
In c# 6.0:
var planetName = "Bob";
var myName = "Ford";
var formattedStr = $"Hello planet {planetName}, my name is {myName}!";
// formattedStr should be "Hello planet Bob, my name is Ford!"
Then concatenate with stringbuilder:
StringBuilder stringBuilder = new StringBuilder();
stringBuilder.Append(formattedStr);
// Then add the strings you need
Append more strings to stringbuilder.....

Most efficient way of adding/removing a character to beginning of string?

I was doing a small 'scalable' C# MVC project, with quite a bit of read/write to a database.
From this, I would need to add/remove the first letter of the input string.
'Removing' the first character is quite easy (using a Substring method) - using something like:
String test = "HHello world";
test = test.Substring(1,test.Length-1);
'Adding' a character efficiently seems to be messy/awkward:
String test = "ello World";
test = "H" + test;
Seeing as this will be done for a lot of records, would this be be the most efficient way of doing these operations?
I am also testing if a string starts with the letter 'T' by using, and adding 'T' if it doesn't by:
String test = "Hello World";
if(test[0]!='T')
{
test = "T" + test;
}
and would like to know if this would be suitable for this
If you have several records and to each of the several records field you need to append a character at the beginning, you can use String.Insert with an index of 0 http://msdn.microsoft.com/it-it/library/system.string.insert(v=vs.110).aspx
string yourString = yourString.Insert( 0, "C" );
This will pretty much do the same of what you wrote in your original post, but since it seems you prefer to use a Method and not an operator...
If you have to append a character several times, to a single string, then you're better using a StringBuilder http://msdn.microsoft.com/it-it/library/system.text.stringbuilder(v=vs.110).aspx
Both are equally efficient I think since both require a new string to be initialized, since string is immutable.
When doing this on the same string multiple times, a StringBuilder might come in handy when adding. That will increase performance over adding.
You could also opt to move this operation to the database side if possible. That might increase performance too.
For removing I would use the remove command as this doesn't require to know the length of the string:
test = test.Remove(0, 1);
You could also treat the string as an array for the Add and use
test = test.Insert(0, "H");
If you are always removing and then adding a character you can treat the string as an array again and just replace the character.
test = (test.ToCharArray()[0] = 'H').ToString();
When doing lots of operations to the same string I would use a StringBuilder though, more expensive to create but faster operations on the string.

Which Regular Expressions and toUpper combination is faster?

I have two text boxes, one for the input and another for the output. I need to filter only Hexadecimals characters from input and output it in uppercase. I have checked that using Regular Expressions (Regex) is much faster than using loop.
My current code to uppercase first then filter the Hex digits as follow:
string strOut = Regex.Replace(inputTextBox.Text.ToUpper(), "[^0-9^A-F]", "");
outputTextBox.Text = strOut;
An alternatively:
string strOut = Regex.Replace(inputTextBox.Text, "[^0-9^A-F^a-f]", "");
outputTextBox.Text = strOut.ToUpper();
The input may contain up to 32k characters, therefore speed is important here. I have used TimeSpan to measure but the results are not consistent.
My question is: which code has better speed performance and why?
This is definitely a case of premature optimization: 32K characters is not a big deal for finely tuned regex engines running on modern computers, so this optimization task is mostly theoretical.
Before discussing the performance, it's worth pointing out that the expressions are probably not doing what you want, because they allow ^ characters into the output. You need to use [^0-9A-F] and [^0-9A-Fa-f] instead.
The speed of the two regexes will be identical, because the number of characters in a character class hardly makes a difference. However, the second combination ToUpper call will be called on a potentially shorter string, because all invalid characters will be removed. Therefore, the second option is potentially slightly faster.
However, if you must optimize this to the last CPU cycle, you can rewrite this without regular expressions to avoid a memory allocation in the ToUpper: walk through the input string in a loop, and add all valid characters to StringBuilder as you go. When you see a lowercase character, convert it to upper case.
It's simple to test:
using System;
using System.Diagnostics;
using System.Linq;
using System.Text.RegularExpressions;
namespace ConsoleApplication1
{
class Program
{
static void Main(string[] args)
{
string letters = "abcdefghijklmnopqrstuvxyzABCDEFGHIJKLMNOPQRSTUVXYZ";
Random random = new Random();
string[] strings = Enumerable.Range(0, 5000).Select(i1 => string.Join("", Enumerable.Range(0,32000).Select(i2 => letters[random.Next(0, letters.Length - 1)]))).ToArray();
Stopwatch stopwatchA = new Stopwatch();
stopwatchA.Start();
foreach (string s in strings)
Regex.Replace(s.ToUpper(), "[^0-9^A-F]", "");
stopwatchA.Stop();
Stopwatch stopwatchB = new Stopwatch();
stopwatchB.Start();
foreach (string s in strings)
Regex.Replace(s, "[^0-9^A-F^a-f]", "").ToUpper();
stopwatchB.Stop();
Debug.WriteLine("stopwatchA: {0}", stopwatchA.Elapsed);
Debug.WriteLine("stopwatchB: {0}", stopwatchB.Elapsed);
}
}
}
Run 1:
stopwatchA: 00:00:39.6552012
stopwatchB: 00:00:40.6757048
Run 2:
stopwatchA: 00:00:39.7022437
stopwatchB: 00:00:41.3477625
In those to runs, the first approach is faster.
On the theoretical size, string.ToUpper() can potentially allocate a new string (remember that .NET strings are semantically immutable), but the regular expression on the other hand conserves memory, i.e. it should be faster in the general case (for large strings). Also the input string will be looped through twice if using the toUpper() call.

Categories