System.String underlying implementation

System.String underlying implementation - c#

I was recently trying to do the following in c#
string str = "u r awesome";
str[0]="i";
And it wouldn't work because apparently str[i] is only a get not a set, so I was wondering what the underlying implementation of string is that would force str[i] to only be a get.
Isn't it just a managed wrapper for a char *? So then why can't I set str[i]?

You can't set characters of a string because the .NET String class is immutable -- that means that its contents cannot be changed after it is created. This allows the same string instance to be used many times safely, without one object worrying that another object is going to stomp on its strings.
If you need a mutable class that lets you manipulate a string of characters, consider using StringBuilder instead.
If you want to compare to C, the String type is like const char * except that you cannot just cast away the constness. StringBuilder is more like a char * (with automatic allocation resizing) and with a method (ToString()) to create a new, independent String instance from its contents.

The answers the others gave concerning immutability are of course correct and are the "actual" cause of the issue your having.
Since you specifically asked about the underlying implementation (and if just out of curiosity), and as a reference to others that might stumble upon this question, here is some more information about that topic from Eric Lippert:
"In the .NET CLR, strings are laid out in memory pretty much the same
way that BSTRs were implemented in OLE Automation: as a word-aligned
memory buffer consisting of a four-byte integer giving the length of
the string, followed by the characters of the string in two-byte
chunks of UTF-16 data, followed by two zero bytes."
Note the "pretty much" part here, however BSTR themselves are also explained in Eric's blog.
Mind you, that all of this should be considered an implementation detail. And even though it shouldn't really concern most of us, it might help though during debugging interop issues or in general understanding.

Like answered by cdhowie, it is not the same to the concept of string in c/c++
If you want to the the above, as a suggestion you can try to mimic the implementation through container such as below
List<char> str = new List<char>("u r awesome");
str[0] = 'i';
str[2] = 'm';
Console.WriteLine(str.ToArray());

Related

IndexOf for char array ignoring casing

I'm developing a pdf file viewer. A pdf file stores it characters in bytes and a pdf file can have several megabytes. Using strings for this scenario is a bad idea, because the storage space of a string cannot be reused for another string. Therefor I store these pdf bytes in a char array. When reading the next big pdf file, I can reuse the char array.
Now I need to support a search functionality, so that the user can find a certain text in this huge file. When I am searching, I usually don't want to have to enter proper upper and lower case letters, I might even not remember the correct casing, meaning the search should succeed regardless of casing. When using
string.IndexOf(String, StringComparison)
one can chose InvariantCultureIgnoreCase to get both upper and lower case matches.
However, converting the megabyte char array into an equally big string is a bad idea.
Unfortunately, IndexOf for an Array is not helpful:
public static int IndexOf<T> (T[] array, T value);
This allows to search for only 1 char in a char array and does also not support IgnoreCase, which obviously wouldn't make sense for other arrays, like an integer array.
So the question is:
Which method can be used from DotNet to search a string in a character array.
Please read this before marking this question as dupplicate
I am aware that there are already similar questions regarding searching. But the ones I have seen all convert the character array in one way or another into a string, which I definitely not want.
Also note that many of those solutions don't support ignoring the casing. The solution should also handle exotic Unicodes correctly.
And last but not least, best would be an existing method from DotNet.

I came to the conclusion that I need to implement my own IndexOf method for character arrays. However, programming that proved rather challenging, so I checked in the DotNet source code how string.IndexOf is doing it.
It's a bit confusing because one method is calling another which calls another, each doing not much. Finally, one arrives at:
public unsafe int IndexOf(ReadOnlySpan<char> source, ReadOnlySpan<char> value,
CompareOptions options = CompareOptions.None)
Lo and behold, that was exactly the functionality I was looking for, because it is very easy to convert a char[] into a ReadOnlySpan<char>. This method belongs to the CompareInfo class. To call it, one has to write something like this:
var index = CultureInfo.InvariantCulture.CompareInfo.IndexOf(bigCharArray,
searchString, CompareOptions.IgnoreCase);

Unsafe string creation from char[]

I'm working on a high performance code in which this construct is part of the performance critical section.
This is what happens in some section:
A string is 'scanned' and metadata is stored efficiently.
Based upon this metadata chunks of the main string are separated into a char[][].
That char[][] should be transferred into a string[].
Now, I know you can just call new string(char[]) but then the result would have to be copied.
To avoid this extra copy step from happening I guess it must be possible to write directly to the string's internal buffer. Even though this would be an unsafe operation (and I know this bring lots of implications like overflow, forward compatibility).
I've seen several ways of achieving this, but none I'm really satisfied with.
Does anyone have true suggestions as to how to achieve this?
Extra information:
The actual process doesn't include converting to char[] necessarily, it's practically a 'multi-substring' operation. Like 3 indexes and their lengths appended.
The StringBuilder has too much overhead for the small number of concats.
EDIT:
Due to some vague aspects of what it is exactly that I'm asking, let me reformulate it.
This is what happens:
Main string is indexed.
Parts of the main string are copied to a char[].
The char[] is converted to a string.
What I'd like to do is merge step 2 and 3, resulting in:
Main string is indexed.
Parts of the main string are copied to a string (and the GC can keep its hands off of it during the process by proper use of the fixed keyword?).
And a note is that I cannot change the output type from string[], since this is an external library, and projects depend on it (backward compatibility).

I think that what you are asking to do is to 'carve up' an existing string in-place into multiple smaller strings without re-allocating character arrays for the smaller strings. This won't work in the managed world.
For one reason why, consider what happens when the garbage collector comes by and collects or moves the original string during a compaction- all of those other strings 'inside' of it are now pointing at some arbitrary other memory, not the original string you carved them out of.
EDIT: In contrast to the character-poking involved in Ben's answer (which is clever but IMHO a bit scary), you can allocate a StringBuilder with a pre-defined capacity, which eliminates the need to re-allocate the internal arrays. See http://msdn.microsoft.com/en-us/library/h1h0a5sy.aspx.

What happens if you do:
string s = GetBuffer();
fixed (char* pch = s) {
pch[0] = 'R';
pch[1] = 'e';
pch[2] = 's';
pch[3] = 'u';
pch[4] = 'l';
pch[5] = 't';
}
I think the world will come to an end (Or at least the .NET managed portion of it), but that's very close to what StringBuilder does.
Do you have profiler data to show that StringBuilder isn't fast enough for your purposes, or is that an assumption?

Just create your own addressing system instead of trying to use unsafe code to map to an internal data structure.
Mapping a string (which is also readable as a char[]) to an array of smaller strings is no different from building a list of address information (index & length of each substring). So make a new List<Tuple<int,int>> instead of a string[] and use that data to return the correct string from your original, unaltered data structure. This could easily be encapsulated into something that exposed string[].

In .NET, there is no way to create an instance of String which shares data with another string. Some discussion on why that is appears in this comment from Eric Lippert.

C# better way to do this?

Hi I have this code below and am looking for a prettier/faster way to do this.
Thanks!
string value = "HelloGoodByeSeeYouLater";
string[] y = new string[]{"Hello", "You"};
foreach(string x in y)
{
value = value.Replace(x, "");
}

You could do:
y.ToList().ForEach(x => value = value.Replace(x, ""));
Although I think your variant is more readable.

Forgive me, but someone's gotta say it,
value = Regex.Replace( value, string.Join("|", y.Select(Regex.Escape)), "" );
Possibly faster, since it creates fewer strings.
EDIT: Credit to Gabe and lasseespeholt for Escape and Select.

While not any prettier, there are other ways to express the same thing.
In LINQ:
value = y.Aggregate(value, (acc, x) => acc.Replace(x, ""));
With String methods:
value = String.Join("", value.Split(y, StringSplitOptions.None));
I don't think anything is going to be faster in managed code than a simple Replace in a foreach though.

It depends on the size of the string you are searching. The foreach example is perfectly fine for small operations but creates a new instance of the string each time it operates because the string is immutable. It also requires searching the whole string over and over again in a linear fashion.
The basic solutions have all been proposed. The Linq examples provided are good if you are comfortable with that syntax; I also liked the suggestion of an extension method, although that is probably the slowest of the proposed solutions. I would avoid a Regex unless you have an extremely specific need.
So let's explore more elaborate solutions and assume you needed to handle a string that was thousands of characters in length and had many possible words to be replaced. If this doesn't apply to the OP's need, maybe it will help someone else.
Method #1 is geared towards large strings with few possible matches.
Method #2 is geared towards short strings with numerous matches.
Method #1
I have handled large-scale parsing in c# using char arrays and pointer math with intelligent seek operations that are optimized for the length and potential frequency of the term being searched for. It follows the methodology of:
Extremely cheap Peeks one character at a time
Only investigate potential matches
Modify output when match is found
For example, you might read through the whole source array and only add words to the output when they are NOT found. This would remove the need to keep redimensioning strings.
A simple example of this technique is looking for a closing HTML tag in a DOM parser. For example, I may read an opening STYLE tag and want to skip through (or buffer) thousands of characters until I find a closing STYLE tag.
This approach provides incredibly high performance, but it's also incredibly complicated if you don't need it (plus you need to be well-versed in memory manipulation/management or you will create all sorts of bugs and instability).
I should note that the .Net string libraries are already incredibly efficient but you can optimize this approach for your own specific needs and achieve better performance (and I have validated this firsthand).
Method #2
Another alternative involves storing search terms in a Dictionary containing Lists of strings. Basically, you decide how long your search prefix needs to be, and read characters from the source string into a buffer until you meet that length. Then, you search your dictionary for all terms that match that string. If a match is found, you explore further by iterating through that List, if not, you know that you can discard the buffer and continue.
Because the Dictionary matches strings based on hash, the search is non-linear and ideal for handling a large number of possible matches.
I'm using this methodology to allow instantaneous (<1ms) searching of every airfield in the US by name, state, city, FAA code, etc. There are 13K airfields in the US, and I've created a map of about 300K permutations (again, a Dictionary with prefixes of varying lengths, each corresponding to a list of matches).
For example, Phoenix, Arizona's main airfield is called Sky Harbor with the short ID of KPHX. I store:
KP
KPH
KPHX
Ph
Pho
Phoe
Ar
Ari
Ariz
Sk
Sky
Ha
Har
Harb
There is a cost in terms of memory usage, but string interning probably reduces this somewhat and the resulting speed justifies the memory usage on data sets of this size. Searching happens as the user types and is so fast that I have actually introduced an artificial delay to smooth out the experience.
Send me a message if you have the need to dig into these methodologies.

Extension method for elegance
(arguably "prettier" at the call level)
I'll implement an extension method that allows you to call your implementation directly on the original string as seen here.
value = value.Remove(y);
// or
value = value.Remove("Hello", "You");
// effectively
string value = "HelloGoodByeSeeYouLater".Remove("Hello", "You");
The extension method is callable on any string value in fact, and therefore easily reusable.
Implementation of Extension method:
I'm going to wrap your own implementation (shown in your question) in an extension method for pretty or elegant points and also employ the params keyword to provide some flexbility passing the arguments. You can substitute somebody else's faster implementation body into this method.
static class EXTENSIONS {
static public string Remove(this string thisString, params string[] arrItems) {
// Whatever implementation you like:
if (thisString == null)
return null;
var temp = thisString;
foreach(string x in arrItems)
temp = temp.Replace(x, "");
return temp;
}
}
That's the brightest idea I can come up with right now that nobody else has touched on.

the internals of System.String

I used reflection to look at the internal fields of System.String and I found three fields:
m_arrayLength
m_stringLength
m_firstChar
I don't understand how this works.
m_arrayLength is the length of some array. Where is this array? It's apparently not a member field of the string class.
m_stringLength makes sense. It's the length of the string.
m_firstChar is the first character in the string.
So my question is where are the rest of the characters for the string? Where are the contents of the string stored if not in the string class?

The first char provides access (via &m_firstChar) to an address in memory of the first character in the buffer. The length tells it how many characters are in the string, making .Length efficient (better than looking for a nul char). Note that strings can be oversized (especially if created with StringBuilder, and a few other scenarios), so sometimes the actual buffer is actually longer than the string. So it is important to track this. StringBuilder, for example, actually mutates a string within its buffer, so it needs to know how much it can add before having to create a larger buffer (see AppendInPlace, for example).

Much of the implementation of System.String is in native code (C/C++) and not in managed code (C#). If you take a look at the decompiled code you'll see that most of the "interesting" or "core" methods are decorated with this attribute:
[MethodImpl(MethodImplOptions.InternalCall)]
Only some of the helper/convenience APIs are implemented in C#.
So where are the characters for the string stored? It's top secret! Deep down inside the CLR's core native code implementation.

I'd be thinking immediately that m_firstChar is not the first character, rather a pointer to the first character. That would make much more sense (although, since I'm not privy to the source, I can't be certain).
It makes little sense to store the first character of a string unless you want a blindingly fast s.substring(0,1) operation :-) There's a good chance the characters themselves (that the three fields allude to) will be allocated separately from the actual object.

Correct answer on difference between string and System.string is here: string vs System.String
There is nothing about native implementations

Is string.Length in C# (.NET) instant variable?

I'm wondering if string.Length in C# is an instant variable. By instant variable I mean, when I create the string:
string A = "";
A = "Som Boh";
Is length being computed now?
OR
Is it computed only after I try to get A.Length?

Firstly, note that strings in .NET are very different to strings stored in unmanaged languages (such as C++)...
In the CLR, the length of the string (in chars and in bytes) is in fact stored in memory so that the CLR knows how large the block of memory (array of chars) containing the string is. This is done upon creation of the string and doesn't get changed given that the System.String type is immutable.
In C++ this is rather different, as the length of a string is discovered by reading up until the first null character.
Because of the way memory usage works in the CLR, you can essentially consider that getting the Length property of a string is just like retrieving an int variable. The performance cost here is going to be absolutely minimal, if that's what you're considering.
If you want to read up more about strings in .NET, try Jon Skeet's article on the topic - it seems to have all the details you might ever want to know about strings in .NET.

The length of the string is not computed, it is known at construction time. Since String is immutable, there will be no need for calculating it later.
A .NET string is stored as a field containing the count of characters, and a corresponding series of unicode characters.

.NET strings are stored with the length pre-computed and stored at the start of the internal structure, so the .Length property simply fetches that value, making it an O(1) function.

It looks like it is a property of string, which is probably set in the constructor. Since it is not a function, I doubt that it is computed when you call it. They are simply getting the value of the Length property.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.