Optimizing string manipulation

Optimizing string manipulation - c#

It is 2019 and we have a banking project which uses mainframe as data store and transactions.
We are using DTO's (Commarea, plain c# class) that is converted to plain string (this is how mainframe works) then sent to Mainframe.
While converting a class to string representation we use several string operations such as substring, pad left, pad right, trim etc.
As you can imagine, this causes several string allocations and hence garbage collection. It is usually at generation 0 but still.
Especially types like Decimal which is a Pack type in mainframe that fits into 8 bytes creates several strings.
I tried using ReadonlySpan<char> for example for substring. See example.
However, there are operations like PadRight, PadLeft which is not avaiable, because it is a read only span.
Update:
To clarify a part of conversion happens as follows:
val.Trim().Substring(5).PadRight(10);
I know that this creates 3 string. I know strings are immutable. My question is about doing the above operation with ReadonlySpan or Memory.
I can not use ReadonlySpan only for substring because as soon as I call ToString method I m losing the benefits.
I have to call ToString all the way at the end.
Is there another construct that supports other operations behind substring, that I can actually add remove data to the memory?
Thanks.

Using ReadOnlySpan can help reduce the number of string allocations in your code, but it won't eliminate them completely. This is because ReadOnlySpan is a read-only view of a sequence of characters, so you cannot modify the underlying data using a ReadOnlySpan.
To avoid unnecessary string allocations, you can use the string.AsSpan() method to get a ReadOnlySpan view of a string, and then use the Span.Slice() method to get substrings without allocating new strings. For example, you could use the following code to get a substring of a string without allocating a new string:
string val = "Hello world";
ReadOnlySpan<char> span = val.AsSpan();
ReadOnlySpan<char> substring = span.Slice(5);
However, as mentioned earlier, you cannot use ReadOnlySpan to modify the underlying data, so you will still need to allocate new strings for operations like PadRight and PadLeft. To avoid these allocations, you can use a StringBuilder to build up the string piece by piece, and then call ToString() on the StringBuilder when you're done. This will allow you to perform string operations without allocating new strings for each operation.
In summary, using ReadOnlySpan can help reduce the number of string allocations in your code, but it won't eliminate them completely. To avoid allocating new strings for each string operation, you can use a StringBuilder to build up the final string piece by piece.
string val = "Hello world";
StringBuilder builder = new StringBuilder(val.Length);
// Trim the string
builder.Append(val.Trim());
// Get a substring starting at the 5th character
builder.Append(val, 5, val.Length - 5);
// Pad the string with spaces to the right, to make it 10 characters long
builder.PadRight(10, ' ');
// Convert the final string to a regular string
string result = builder.ToString();

Related

Parsing a string to get a specific value

I'm new to C#. I'm parsing for a lot number in a 2D barcode. The actual lot number 'A2351' is hidden in this barcode string "+M727PP011/$$3201001A2351S". I would like to break this barcode up in separate string blocks but the delimiters are not consistent.
The letter prefix in front of the 4 digit lot number can be a 'A', 'P', or a 'D' There is a single letter following the lot number that can be ignored.
string Delimiter = "/$$3";
//barcode format:M###PP###/$$3 ddmmyy lotnumprefix 'A' followed by lotNum
string lotNum= "+M727PP011/$$3201001A2351S";
string[] split = lotNum.Split(new[] {Delimiter}, StringSplitOptions.None);
How do I extract the lot number after the date?

Based on your initial example and then the subsequent edit in which you showed how you are solving this, it sounds like the lot number is always in the same place. It would be cleaner (and more in line with standard C# code) to use a single call to string.Substring(int,int) rather than the two lines you are using which also require pulling in the VB library. You just need to call Substring and give it the starting index and the length.
So this code:
string lotNum = Strings.Right(barcode, 6);
lotNum = lotNum.Remove((lotNum.Length - 1), 1);
Can be done with this single substring call:
string lotNum = barcode.Substring(barcode.Length - 6, 5);
Edit
Just further clarification on why it might be better to use the call to Substring. In C# string objects are immutable. That means that when you make the call to Strings.Right you are getting back a new string object. When you then call lotNum.Remove you do not "remove" a character from the existing string, a new string is allocated with the character(s) removed and is returned to you. So with your code there are two new string allocations when trying to extract the lot number. When you make the call to Substring you will get back a new string, but instead of getting a new string that you immediately then modify and get a second new string, you will only need to allocate one new string to extract the lot number. In the example you have given there probably would not be any noticeable performance/memory issue, but it is something that could potentially lead to trouble if this code was in a tight loop or something like that.

If you're just trying to get the lot number, it's really dependent on the format of the input string (is it a consistent length, are there any reliable prefixes/suffixes relative to the data you're trying to parse that you can reference from, etc). It looks like your data is definable by its static position in the string, so it looks like you could use the substring
(with an index of 20?) method to accomplish what you want.

Most efficient way of adding/removing a character to beginning of string?

I was doing a small 'scalable' C# MVC project, with quite a bit of read/write to a database.
From this, I would need to add/remove the first letter of the input string.
'Removing' the first character is quite easy (using a Substring method) - using something like:
String test = "HHello world";
test = test.Substring(1,test.Length-1);
'Adding' a character efficiently seems to be messy/awkward:
String test = "ello World";
test = "H" + test;
Seeing as this will be done for a lot of records, would this be be the most efficient way of doing these operations?
I am also testing if a string starts with the letter 'T' by using, and adding 'T' if it doesn't by:
String test = "Hello World";
if(test[0]!='T')
{
test = "T" + test;
}
and would like to know if this would be suitable for this

If you have several records and to each of the several records field you need to append a character at the beginning, you can use String.Insert with an index of 0 http://msdn.microsoft.com/it-it/library/system.string.insert(v=vs.110).aspx
string yourString = yourString.Insert( 0, "C" );
This will pretty much do the same of what you wrote in your original post, but since it seems you prefer to use a Method and not an operator...
If you have to append a character several times, to a single string, then you're better using a StringBuilder http://msdn.microsoft.com/it-it/library/system.text.stringbuilder(v=vs.110).aspx

Both are equally efficient I think since both require a new string to be initialized, since string is immutable.
When doing this on the same string multiple times, a StringBuilder might come in handy when adding. That will increase performance over adding.
You could also opt to move this operation to the database side if possible. That might increase performance too.

For removing I would use the remove command as this doesn't require to know the length of the string:
test = test.Remove(0, 1);
You could also treat the string as an array for the Add and use
test = test.Insert(0, "H");
If you are always removing and then adding a character you can treat the string as an array again and just replace the character.
test = (test.ToCharArray()[0] = 'H').ToString();
When doing lots of operations to the same string I would use a StringBuilder though, more expensive to create but faster operations on the string.

Why does `String.Trim()` not trim the object itself?

Not often but sometimes I need to use String.Trim() to remove whitespaces of a string.
If it was a longer time since last trim coding I write:
string s = " text ";
s.Trim();
and be surprised why s is not changed. I need to write:
string s = " text ";
s = s.Trim();
Why are some string methods designed in this (not very intuitive) way? Is there something special with strings?

Strings are immutable. Any string operation generates a new string without changing the original string.
From MSDN:
Strings are immutable--the contents of a string object cannot be
changed after the object is created, although the syntax makes it
appear as if you can do this.

s.Trim() creates a new trimmed version of the original string and returns it instead of storing the new version in s. So, what you have to do is to store the trimmed instance in your variable:
s = s.Trim();
This pattern is followed in all the string methods and extension methods.
The fact that string is immutable doesn't have to do with the decision to use this pattern, but with the fact of how strings are kept in memory. This methods could have been designed to create the new modified string instance in memory and point the variable to the new instance.
It's also good to remember that if you need to make lots of modifications to a string, it's much better to use an StringBuilder, which behaves like a "mutable" string, and it's much more eficient doing this kind of operations.

As it is written in MSDN Library:
A String object is called immutable (read-only), because its value
cannot be modified after it has been created. Methods that appear to
modify a String object actually return a new String object that
contains the modification.
Because strings are immutable, string manipulation routines that
perform repeated additions or deletions to what appears to be a single
string can exact a significant performance penalty.
See this link.

In addition to all the good answers, I also feel that the reason being Threadsaftey.
Lets say
string s = " any text ";
s.Trim();
When you say this there is nothing stopping the other thread from modifying s. If the same string is modified, lets say the other thread remove 'a' from s, then what is the result of s.Trim()?
But when it returns the new string, though it is being modified by the other thread, the trim can make a local copy modify it and return modified string.

benefits of using a stringbuilder [duplicate]

This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
String vs StringBuilder
Hi,
I'm creating a json string. I have some json encoders that receive objects and return json string. I want to assemble these strings into one long string.
What's the difference between using a string builder and declaring a string an appending strings to it.
Thanks.

When you append to a string, you are creating a new object each time you append, because strings are immutable in .NET.
When using a StringBuilder, you build up the string in a pre-allocated buffer.
That is, for each append to a normal string; you are creating a new object and copying all the characters into it. Because all the little (or big) temporary string objects eventually will need to get garbage-collected, appending a lot of strings together can be a performance problem. Therefore, it is generally a good idea to use a StringBuilder when dynamically appending a lot of strings.

string is immutable and you allocate new memory each time you append strings.
StringBuilder allows you to add characters to an object and when you need to use the string representation, you call ToString() on it.

StringBuilder works like string.format() and is more efficient than manually appending strings or +ing strings. Using + or manually appending creates multiple string objects in memory.

Copies. The stringbuilder doesn't make new copies of the strings every time; AFAIK Append just copies the bytes into a pre-allocated buffer most of the time rather than reallocating a new string. It is significantly faster! We use it at work all the time.

string.Format is using StringBuilders inside. Using StringBuilder is more optimal because you will work with it exactly as you need without the overhead that Format() needs to interpret all your args in your format string.
Imagine only that string.Format() needs to find all your "{N}" sequences in your formatting string... An extra job, huh?

Strings are immutable in C#. This makes appending strings a relatively expensive. StringBuilder solves this problem by creating a buffer and characters are added to the buffer and converted to string at the end of operation.
Look here for more info.

In the .NET Framework everytime you add another string to an existing string in creates a completely new instance of a string. (This takes up a lot of memory after a while)
StringBuilder uses a single instance even when you add more strings to it.
It has everything to do with performance.

String vs StringBuilder will help you understand the different between String and StringBuilder.

StringBuilder: how to get the final String?

Someone told me that it's faster to concatenate strings with StringBuilder. I have changed my code but I do not see any Properties or Methods to get the final build string.
How can I get the string?

You can use .ToString() to get the String from the StringBuilder.

Once you have completed the processing using the StringBuilder, use the ToString method to return the final result.
From MSDN:
using System;
using System.Text;
public sealed class App
{
static void Main()
{
// Create a StringBuilder that expects to hold 50 characters.
// Initialize the StringBuilder with "ABC".
StringBuilder sb = new StringBuilder("ABC", 50);
// Append three characters (D, E, and F) to the end of the StringBuilder.
sb.Append(new char[] { 'D', 'E', 'F' });
// Append a format string to the end of the StringBuilder.
sb.AppendFormat("GHI{0}{1}", 'J', 'k');
// Display the number of characters in the StringBuilder and its string.
Console.WriteLine("{0} chars: {1}", sb.Length, sb.ToString());
// Insert a string at the beginning of the StringBuilder.
sb.Insert(0, "Alphabet: ");
// Replace all lowercase k's with uppercase K's.
sb.Replace('k', 'K');
// Display the number of characters in the StringBuilder and its string.
Console.WriteLine("{0} chars: {1}", sb.Length, sb.ToString());
}
}
// This code produces the following output.
//
// 11 chars: ABCDEFGHIJk
// 21 chars: Alphabet: ABCDEFGHIJK

When you say "it's faster to concatenate strings with a StringBuilder", this is only true if you are repeatedly (I repeat - repeatedly) concatenating to the same object.
If you're just concatenating 2 strings and doing something with the result immediately as a string, there's no point to using StringBuilder.
I just stumbled on Jon Skeet's nice write up of this:
https://jonskeet.uk/csharp/stringbuilder.html
If you are using StringBuilder, then to get the resulting string, it's just a matter of calling ToString() (unsurprisingly).

I would just like to throw out that is may not necessarily faster, it will definitely have a better memory footprint. This is because string are immutable in .NET and every time you change a string you have created a new one.

About it being faster/better memory:
I looked into this issue with Java, I assume .NET would be as smart about it.
The implementation for String is pretty impressive.
The String object tracks "length" and "shared" (independent of the length of the array that holds the string)
So something like
String a = "abc" + "def" + "ghi";
can be implemented (by the compiler/runtime) as:
- Extend the array holding "abc" by 6 additional spaces.
- Copy def in right after abc
- copy ghi in after def.
- give a pointer to the "abc" string to a
- leave abc's length at 3, set a's length to 9
- set the shared flag in both.
Since most strings are short-lived, this makes for some VERY efficient code in many cases.
The case where it's absolutely NOT efficient is when you are adding to a string within a loop, or when your code is like this:
a = "abc";
a = a + "def";
a += "ghi";
In this case, you are much better off using a StringBuilder construct.
My point is that you should be careful whenever you optimize, unless you are ABSOLUTELY sure that you know what you are doing, AND you are absolutely sure it's necessary, AND you test to ensure the optimized code makes a use case pass, just code it in the most readable way possible and don't try to out-think the compiler.
I wasted 3 days messing with strings, caching/reusing string-builders and testing speed before I looked at the string source code and figured out that the compiler was already doing it better than I possibly could for my use case. Then I had to explain how I didn't REALLY know what I was doing, I only thought I did...

It's not faster to concat - As smaclell pointed out, the issue is the immutable string forcing an extra allocation and recopying of existing data.
"a"+"b"+"c" is no faster to do with string builder, but repeated concats with an intermediate string gets faster and faster as the # of concat's gets larger like:
x = "a"; x+="b"; x+="c"; ...

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.