the internals of System.String - c#

I used reflection to look at the internal fields of System.String and I found three fields:
m_arrayLength
m_stringLength
m_firstChar
I don't understand how this works.
m_arrayLength is the length of some array. Where is this array? It's apparently not a member field of the string class.
m_stringLength makes sense. It's the length of the string.
m_firstChar is the first character in the string.
So my question is where are the rest of the characters for the string? Where are the contents of the string stored if not in the string class?

The first char provides access (via &m_firstChar) to an address in memory of the first character in the buffer. The length tells it how many characters are in the string, making .Length efficient (better than looking for a nul char). Note that strings can be oversized (especially if created with StringBuilder, and a few other scenarios), so sometimes the actual buffer is actually longer than the string. So it is important to track this. StringBuilder, for example, actually mutates a string within its buffer, so it needs to know how much it can add before having to create a larger buffer (see AppendInPlace, for example).

Much of the implementation of System.String is in native code (C/C++) and not in managed code (C#). If you take a look at the decompiled code you'll see that most of the "interesting" or "core" methods are decorated with this attribute:
[MethodImpl(MethodImplOptions.InternalCall)]
Only some of the helper/convenience APIs are implemented in C#.
So where are the characters for the string stored? It's top secret! Deep down inside the CLR's core native code implementation.

I'd be thinking immediately that m_firstChar is not the first character, rather a pointer to the first character. That would make much more sense (although, since I'm not privy to the source, I can't be certain).
It makes little sense to store the first character of a string unless you want a blindingly fast s.substring(0,1) operation :-) There's a good chance the characters themselves (that the three fields allude to) will be allocated separately from the actual object.

Correct answer on difference between string and System.string is here: string vs System.String
There is nothing about native implementations

Related

IndexOf for char array ignoring casing

I'm developing a pdf file viewer. A pdf file stores it characters in bytes and a pdf file can have several megabytes. Using strings for this scenario is a bad idea, because the storage space of a string cannot be reused for another string. Therefor I store these pdf bytes in a char array. When reading the next big pdf file, I can reuse the char array.
Now I need to support a search functionality, so that the user can find a certain text in this huge file. When I am searching, I usually don't want to have to enter proper upper and lower case letters, I might even not remember the correct casing, meaning the search should succeed regardless of casing. When using
string.IndexOf(String, StringComparison)
one can chose InvariantCultureIgnoreCase to get both upper and lower case matches.
However, converting the megabyte char array into an equally big string is a bad idea.
Unfortunately, IndexOf for an Array is not helpful:
public static int IndexOf<T> (T[] array, T value);
This allows to search for only 1 char in a char array and does also not support IgnoreCase, which obviously wouldn't make sense for other arrays, like an integer array.
So the question is:
Which method can be used from DotNet to search a string in a character array.
Please read this before marking this question as dupplicate
I am aware that there are already similar questions regarding searching. But the ones I have seen all convert the character array in one way or another into a string, which I definitely not want.
Also note that many of those solutions don't support ignoring the casing. The solution should also handle exotic Unicodes correctly.
And last but not least, best would be an existing method from DotNet.
I came to the conclusion that I need to implement my own IndexOf method for character arrays. However, programming that proved rather challenging, so I checked in the DotNet source code how string.IndexOf is doing it.
It's a bit confusing because one method is calling another which calls another, each doing not much. Finally, one arrives at:
public unsafe int IndexOf(ReadOnlySpan<char> source, ReadOnlySpan<char> value,
CompareOptions options = CompareOptions.None)
Lo and behold, that was exactly the functionality I was looking for, because it is very easy to convert a char[] into a ReadOnlySpan<char>. This method belongs to the CompareInfo class. To call it, one has to write something like this:
var index = CultureInfo.InvariantCulture.CompareInfo.IndexOf(bigCharArray,
searchString, CompareOptions.IgnoreCase);

C# - How do I know the correct length to read a string when reading it from a memory address?

So I have a problem: I'm reading a string from a memory address that is different at different times. For example:
Axe?ca Ocarina?tar??ing?ing????????????
I only need Axe.
Ball of Green Yarn??ing?ing????????????
I only need Ball of Green Yarn.
I'm reading 80 bytes of text (40 chars) because that's the most amount of characters the string should get to. But how can I know how long the string actually is?
It really depends on what's writing the string.
Generally, strings are NUL-terminated, i.e. a '\0' character immediately follows the string. Old-style (non-_s-variant) C functions like strlen and strcat use that to determine the end of existing strings and mark the end of modified strings.
Most string data types tend to work this way, but not all. In Turbo Pascal, strings were length-prefixed. BSTRs used in COM (including pre-.NET VB) are both.
Based on the samples you've shown, there's a good chance that the ? character you're seeing after the part you want is a NUL character. It looks like the buffer is being reused and re-terminated each time, e.g. a shorter string like "Axe" was written over a longer string like a certain kind of ocarina.
Examine the buffer in the debugger and you'll probably find a '\0' character immediately following what you want.
Probably. Again, it depends on what's writing the string. Until you look for yourself, it could be anything, and even then, it could just be a coincidence that it's NUL-terminated this time. Don't rely on observation alone. Without documentation, it could be different and still just as valid. Whatever you do, do not read past the 40-character buffer you know you have, NUL terminated or not.

Unsafe string creation from char[]

I'm working on a high performance code in which this construct is part of the performance critical section.
This is what happens in some section:
A string is 'scanned' and metadata is stored efficiently.
Based upon this metadata chunks of the main string are separated into a char[][].
That char[][] should be transferred into a string[].
Now, I know you can just call new string(char[]) but then the result would have to be copied.
To avoid this extra copy step from happening I guess it must be possible to write directly to the string's internal buffer. Even though this would be an unsafe operation (and I know this bring lots of implications like overflow, forward compatibility).
I've seen several ways of achieving this, but none I'm really satisfied with.
Does anyone have true suggestions as to how to achieve this?
Extra information:
The actual process doesn't include converting to char[] necessarily, it's practically a 'multi-substring' operation. Like 3 indexes and their lengths appended.
The StringBuilder has too much overhead for the small number of concats.
EDIT:
Due to some vague aspects of what it is exactly that I'm asking, let me reformulate it.
This is what happens:
Main string is indexed.
Parts of the main string are copied to a char[].
The char[] is converted to a string.
What I'd like to do is merge step 2 and 3, resulting in:
Main string is indexed.
Parts of the main string are copied to a string (and the GC can keep its hands off of it during the process by proper use of the fixed keyword?).
And a note is that I cannot change the output type from string[], since this is an external library, and projects depend on it (backward compatibility).
I think that what you are asking to do is to 'carve up' an existing string in-place into multiple smaller strings without re-allocating character arrays for the smaller strings. This won't work in the managed world.
For one reason why, consider what happens when the garbage collector comes by and collects or moves the original string during a compaction- all of those other strings 'inside' of it are now pointing at some arbitrary other memory, not the original string you carved them out of.
EDIT: In contrast to the character-poking involved in Ben's answer (which is clever but IMHO a bit scary), you can allocate a StringBuilder with a pre-defined capacity, which eliminates the need to re-allocate the internal arrays. See http://msdn.microsoft.com/en-us/library/h1h0a5sy.aspx.
What happens if you do:
string s = GetBuffer();
fixed (char* pch = s) {
pch[0] = 'R';
pch[1] = 'e';
pch[2] = 's';
pch[3] = 'u';
pch[4] = 'l';
pch[5] = 't';
}
I think the world will come to an end (Or at least the .NET managed portion of it), but that's very close to what StringBuilder does.
Do you have profiler data to show that StringBuilder isn't fast enough for your purposes, or is that an assumption?
Just create your own addressing system instead of trying to use unsafe code to map to an internal data structure.
Mapping a string (which is also readable as a char[]) to an array of smaller strings is no different from building a list of address information (index & length of each substring). So make a new List<Tuple<int,int>> instead of a string[] and use that data to return the correct string from your original, unaltered data structure. This could easily be encapsulated into something that exposed string[].
In .NET, there is no way to create an instance of String which shares data with another string. Some discussion on why that is appears in this comment from Eric Lippert.

System.String underlying implementation

I was recently trying to do the following in c#
string str = "u r awesome";
str[0]="i";
And it wouldn't work because apparently str[i] is only a get not a set, so I was wondering what the underlying implementation of string is that would force str[i] to only be a get.
Isn't it just a managed wrapper for a char *? So then why can't I set str[i]?
You can't set characters of a string because the .NET String class is immutable -- that means that its contents cannot be changed after it is created. This allows the same string instance to be used many times safely, without one object worrying that another object is going to stomp on its strings.
If you need a mutable class that lets you manipulate a string of characters, consider using StringBuilder instead.
If you want to compare to C, the String type is like const char * except that you cannot just cast away the constness. StringBuilder is more like a char * (with automatic allocation resizing) and with a method (ToString()) to create a new, independent String instance from its contents.
The answers the others gave concerning immutability are of course correct and are the "actual" cause of the issue your having.
Since you specifically asked about the underlying implementation (and if just out of curiosity), and as a reference to others that might stumble upon this question, here is some more information about that topic from Eric Lippert:
"In the .NET CLR, strings are laid out in memory pretty much the same
way that BSTRs were implemented in OLE Automation: as a word-aligned
memory buffer consisting of a four-byte integer giving the length of
the string, followed by the characters of the string in two-byte
chunks of UTF-16 data, followed by two zero bytes."
Note the "pretty much" part here, however BSTR themselves are also explained in Eric's blog.
Mind you, that all of this should be considered an implementation detail. And even though it shouldn't really concern most of us, it might help though during debugging interop issues or in general understanding.
Like answered by cdhowie, it is not the same to the concept of string in c/c++
If you want to the the above, as a suggestion you can try to mimic the implementation through container such as below
List<char> str = new List<char>("u r awesome");
str[0] = 'i';
str[2] = 'm';
Console.WriteLine(str.ToArray());

Is string.Length in C# (.NET) instant variable?

I'm wondering if string.Length in C# is an instant variable. By instant variable I mean, when I create the string:
string A = "";
A = "Som Boh";
Is length being computed now?
OR
Is it computed only after I try to get A.Length?
Firstly, note that strings in .NET are very different to strings stored in unmanaged languages (such as C++)...
In the CLR, the length of the string (in chars and in bytes) is in fact stored in memory so that the CLR knows how large the block of memory (array of chars) containing the string is. This is done upon creation of the string and doesn't get changed given that the System.String type is immutable.
In C++ this is rather different, as the length of a string is discovered by reading up until the first null character.
Because of the way memory usage works in the CLR, you can essentially consider that getting the Length property of a string is just like retrieving an int variable. The performance cost here is going to be absolutely minimal, if that's what you're considering.
If you want to read up more about strings in .NET, try Jon Skeet's article on the topic - it seems to have all the details you might ever want to know about strings in .NET.
The length of the string is not computed, it is known at construction time. Since String is immutable, there will be no need for calculating it later.
A .NET string is stored as a field containing the count of characters, and a corresponding series of unicode characters.
.NET strings are stored with the length pre-computed and stored at the start of the internal structure, so the .Length property simply fetches that value, making it an O(1) function.
It looks like it is a property of string, which is probably set in the constructor. Since it is not a function, I doubt that it is computed when you call it. They are simply getting the value of the Length property.

Categories