Is string.Length in C# (.NET) instant variable? - c#

I'm wondering if string.Length in C# is an instant variable. By instant variable I mean, when I create the string:
string A = "";
A = "Som Boh";
Is length being computed now?
OR
Is it computed only after I try to get A.Length?

Firstly, note that strings in .NET are very different to strings stored in unmanaged languages (such as C++)...
In the CLR, the length of the string (in chars and in bytes) is in fact stored in memory so that the CLR knows how large the block of memory (array of chars) containing the string is. This is done upon creation of the string and doesn't get changed given that the System.String type is immutable.
In C++ this is rather different, as the length of a string is discovered by reading up until the first null character.
Because of the way memory usage works in the CLR, you can essentially consider that getting the Length property of a string is just like retrieving an int variable. The performance cost here is going to be absolutely minimal, if that's what you're considering.
If you want to read up more about strings in .NET, try Jon Skeet's article on the topic - it seems to have all the details you might ever want to know about strings in .NET.

The length of the string is not computed, it is known at construction time. Since String is immutable, there will be no need for calculating it later.
A .NET string is stored as a field containing the count of characters, and a corresponding series of unicode characters.

.NET strings are stored with the length pre-computed and stored at the start of the internal structure, so the .Length property simply fetches that value, making it an O(1) function.

It looks like it is a property of string, which is probably set in the constructor. Since it is not a function, I doubt that it is computed when you call it. They are simply getting the value of the Length property.

Related

How are String and Char types stored in memory in .NET?

I'd need to store a language code string, such as "en", which will always contains 2 characters.
Is it better to define the type as "String" or "Char"?
private string languageCode;
vs
private char[] languageCode;
Or is there another, better option?
How are these 2 stored in memory? how many bytes or bits for will be allocated to them when values assigned?
How They Are Stored
Both the string and the char[] are stored on the heap - so storage is the same. Internally I would assume a string simply is a cover for char[] with lots of extra code to make it useful for you.
Also if you have lots of repeating strings, you can make use of Interning to reduce the memory footprint of those strings.
The Better Option
I would favour string - it is immediately more apparent what the data type is and how you intend to use it. People are also more accustomed to using strings so maintainability won't suffer. You will also benefit greatly from all the boilerplate code that has been done for you. Microsoft have also put a lot of effort in to make sure the string type is not a performance hog.
The Allocation Size
I have no idea how much is allocated, I believe strings are quite efficient in that they only allocate enough to store the Unicode characters - as they are immutable it is safe to do this. Arrays also cannot be resized without allocating the space in a new array, so I'd again assume they grab only what they need.
Overhead of a .NET array?
Alternatives
Based on your information that there are only 20 language codes and performance is key, you could declare your own enum in order to reduce the size required to represent the codes:
enum LanguageCode : byte
{
en = 0,
}
This will only take 1 byte as opposed to 4+ for two char (in an array), but it does limit the range of available LanguageCode values to the range of byte - which is more than big enough for 20 items.
You can see the size of value types using the sizeof() operator: sizeof(LanguageCode). Enums are nothing but the underlying type under the hood, they default to int, but as you can see in my code sample you can change that by "inheriting" a new type.
Short answer: Use string
Long answer:
private string languageCode;
AFAIK strings are stored as a length prefixed array of chars. A String object is instantiated on the heap to maintain this raw array. But a String object is much more than a simple array it enables basic string operations like comparison, concatenation, substring extraction, search etc
While
private char[] languageCode;
will be stored as an Array of chars i.e. an Array object will be created on the heap and then it will be used to manage your characters. But it still has a length attribute which is stored internally so there are no apparent savings in memory when compared to a string. Though presumably an Array is simpler than a String and may have fewer internal variables thus offering a lower memory foot print (this needs to be verified).
But OTOH you loose the ability to perform string operations on this char array. Even operations like string comparison become cumbersome now. So long story short use a string!
How are these 2 stored in memory? how many bytes or bits for will be allocated to them when values assigned?
Every instance in .NET is stored as follows: one IntPtr-sized field for the type identifier; one more for locking on the instance; the remainder is instance field data rounded up to an IntPtr-sized amount. Hence, on a 32-bit platform every instance occupies 8 bytes + field data.
This applies to both a string and a char[]. Both of these also store the length of the data as an IntPtr-sized integer, followed by the actual data. Thus, a two-character string and a two-character char[], on a 32-bit platform, will occupy 8+4+4 = 16 bytes.
The only way to reduce this when storing exactly two characters is to store the actual characters, or a struct containing the characters, in a field or an array. All of these would consume only 4 bytes for the characters:
// Option 1
class MyClass
{
char Char1, Char2;
}
// Option 2
class MyClass
{
CharStruct chars;
}
...
struct CharStruct { public char Char1; public char Char2; }
MyClass will end up using 8 bytes (on a 32-bit machine) per instance plus the 4 bytes for the chars.
// Option 3
class MyClass
{
CharStruct[] chars;
}
This will use 8 bytes for the MyClass overhead, plus 4 bytes for the chars reference, plus 12 bytes for the array's overhead, plus 4 bytes per CharStruct in the array.
If you want to store exactly 2 chars, and do it most efficiently, use a struct:
struct Char2
{
public char C1, C2;
}
Using this struct will generally not cause new heap allocations. It will just upsize an existing object (by the minimum possible amount) or consume stack space which is very cheap.
Strings indeed have a size overhead of one pointer length, i.e. 4 bytes for a 32 bit process, 8 bytes for a 64 bit process. But then again, strings offer so much more in return than char arrays.
If your application uses many short strings and you don't need to use their string properties and methods that often, you could probably safe a few bytes of memory. But if you want to use any of them as a string, you will first have to create a new string instance. I can't see how this will help you safe enough memory to be worth the trouble.
String just implements an indexer of type char internally and we can say that string is just equivalent to char[] type with lots of extra code to make it useful for you, hence, like an array, it is stored on heap always.
An array cannot be manipulated without allocating it new space, same will be the case of a string hence, it is immutable
String implements IEnumerable<char>
Noticeable point: When you pass a string to a function, it is a pass by value unless there is a use of ref

Unsafe string creation from char[]

I'm working on a high performance code in which this construct is part of the performance critical section.
This is what happens in some section:
A string is 'scanned' and metadata is stored efficiently.
Based upon this metadata chunks of the main string are separated into a char[][].
That char[][] should be transferred into a string[].
Now, I know you can just call new string(char[]) but then the result would have to be copied.
To avoid this extra copy step from happening I guess it must be possible to write directly to the string's internal buffer. Even though this would be an unsafe operation (and I know this bring lots of implications like overflow, forward compatibility).
I've seen several ways of achieving this, but none I'm really satisfied with.
Does anyone have true suggestions as to how to achieve this?
Extra information:
The actual process doesn't include converting to char[] necessarily, it's practically a 'multi-substring' operation. Like 3 indexes and their lengths appended.
The StringBuilder has too much overhead for the small number of concats.
EDIT:
Due to some vague aspects of what it is exactly that I'm asking, let me reformulate it.
This is what happens:
Main string is indexed.
Parts of the main string are copied to a char[].
The char[] is converted to a string.
What I'd like to do is merge step 2 and 3, resulting in:
Main string is indexed.
Parts of the main string are copied to a string (and the GC can keep its hands off of it during the process by proper use of the fixed keyword?).
And a note is that I cannot change the output type from string[], since this is an external library, and projects depend on it (backward compatibility).
I think that what you are asking to do is to 'carve up' an existing string in-place into multiple smaller strings without re-allocating character arrays for the smaller strings. This won't work in the managed world.
For one reason why, consider what happens when the garbage collector comes by and collects or moves the original string during a compaction- all of those other strings 'inside' of it are now pointing at some arbitrary other memory, not the original string you carved them out of.
EDIT: In contrast to the character-poking involved in Ben's answer (which is clever but IMHO a bit scary), you can allocate a StringBuilder with a pre-defined capacity, which eliminates the need to re-allocate the internal arrays. See http://msdn.microsoft.com/en-us/library/h1h0a5sy.aspx.
What happens if you do:
string s = GetBuffer();
fixed (char* pch = s) {
pch[0] = 'R';
pch[1] = 'e';
pch[2] = 's';
pch[3] = 'u';
pch[4] = 'l';
pch[5] = 't';
}
I think the world will come to an end (Or at least the .NET managed portion of it), but that's very close to what StringBuilder does.
Do you have profiler data to show that StringBuilder isn't fast enough for your purposes, or is that an assumption?
Just create your own addressing system instead of trying to use unsafe code to map to an internal data structure.
Mapping a string (which is also readable as a char[]) to an array of smaller strings is no different from building a list of address information (index & length of each substring). So make a new List<Tuple<int,int>> instead of a string[] and use that data to return the correct string from your original, unaltered data structure. This could easily be encapsulated into something that exposed string[].
In .NET, there is no way to create an instance of String which shares data with another string. Some discussion on why that is appears in this comment from Eric Lippert.

System.String underlying implementation

I was recently trying to do the following in c#
string str = "u r awesome";
str[0]="i";
And it wouldn't work because apparently str[i] is only a get not a set, so I was wondering what the underlying implementation of string is that would force str[i] to only be a get.
Isn't it just a managed wrapper for a char *? So then why can't I set str[i]?
You can't set characters of a string because the .NET String class is immutable -- that means that its contents cannot be changed after it is created. This allows the same string instance to be used many times safely, without one object worrying that another object is going to stomp on its strings.
If you need a mutable class that lets you manipulate a string of characters, consider using StringBuilder instead.
If you want to compare to C, the String type is like const char * except that you cannot just cast away the constness. StringBuilder is more like a char * (with automatic allocation resizing) and with a method (ToString()) to create a new, independent String instance from its contents.
The answers the others gave concerning immutability are of course correct and are the "actual" cause of the issue your having.
Since you specifically asked about the underlying implementation (and if just out of curiosity), and as a reference to others that might stumble upon this question, here is some more information about that topic from Eric Lippert:
"In the .NET CLR, strings are laid out in memory pretty much the same
way that BSTRs were implemented in OLE Automation: as a word-aligned
memory buffer consisting of a four-byte integer giving the length of
the string, followed by the characters of the string in two-byte
chunks of UTF-16 data, followed by two zero bytes."
Note the "pretty much" part here, however BSTR themselves are also explained in Eric's blog.
Mind you, that all of this should be considered an implementation detail. And even though it shouldn't really concern most of us, it might help though during debugging interop issues or in general understanding.
Like answered by cdhowie, it is not the same to the concept of string in c/c++
If you want to the the above, as a suggestion you can try to mimic the implementation through container such as below
List<char> str = new List<char>("u r awesome");
str[0] = 'i';
str[2] = 'm';
Console.WriteLine(str.ToArray());

Are C# Strings (and other .NET API's) limited to 2GB in size?

Today I noticed that C#'s String class returns the length of a string as an Int. Since an Int is always 32-bits, no matter what the architecture, does this mean that a string can only be 2GB or less in length?
A 2GB string would be very unusual, and present many problems along with it. However, most .NET api's seem to use 'int' to convey values such as length and count. Does this mean we are forever limited to collection sizes which fit in 32-bits?
Seems like a fundamental problem with the .NET API's. I would have expected things like count and length to be returned via the equivalent of 'size_t'.
Seems like a fundamental problem with
the .NET API...
I don't know if I'd go that far.
Consider almost any collection class in .NET. Chances are it has a Count property that returns an int. So this suggests the class is bounded at a size of int.MaxValue (2147483647). That's not really a problem; it's a limitation -- and a perfectly reasonable one, in the vast majority of scenarios.
Anyway, what would the alternative be? There's uint -- but that's not CLS-compliant. Then there's long...
What if Length returned a long?
An additional 32 bits of memory would be required anywhere you wanted to know the length of a string.
The benefit would be: we could have strings taking up billions of gigabytes of RAM. Hooray.
Try to imagine the mind-boggling cost of some code like this:
// Lord knows how many characters
string ulysses = GetUlyssesText();
// allocate an entirely new string of roughly equivalent size
string schmulysses = ulysses.Replace("Ulysses", "Schmulysses");
Basically, if you're thinking of string as a data structure meant to store an unlimited quantity of text, you've got unrealistic expectations. When it comes to objects of this size, it becomes questionable whether you have any need to hold them in memory at all (as opposed to hard disk).
Correct, the maximum length would be the size of Int32, however you'll likely run into other memory issues if you're dealing with strings larger than that anyway.
At some value of String.length() probably about 5MB its not really practical to use String anymore. String is optimised for short bits of text.
Think about what happens when you do
msString += " more chars"
Something like:
System calculates length of myString plus length of " more chars"
System allocates that amount of memory
System copies myString to new memory location
System copies " more chars" to new memory location after last copied myString char
The original myString is left to the mercy of the garbage collector.
While this is nice and neat for small bits of text its a nightmare for large strings, just finding 2GB of contiguous memory is probably a showstopper.
So if you know you are handling more than a very few MB of characters use one of the *Buffer classes.
It's pretty unlikely that you'll need to store more than two billion objects in a single collection. You're going to incur some pretty serious performance penalties when doing enumerations and lookups, which are the two primary purposes of collections. If you're dealing with a data set that large, There is almost assuredly some other route you can take, such as splitting up your single collection into many smaller collections that contain portions of the entire set of data you're working with.
Heeeey, wait a sec.... we already have this concept -- it's called a dictionary!
If you need to store, say, 5 billion English strings, use this type:
Dictionary<string, List<string>> bigStringContainer;
Let's make the key string represent, say, the first two characters of the string. Then write an extension method like this:
public static string BigStringIndex(this string s)
{
return String.Concat(s[0], s[1]);
}
and then add items to bigStringContainer like this:
bigStringContainer[item.BigStringIndex()].Add(item);
and call it a day. (There are obviously more efficient ways you could do that, but this is just an example)
Oh, and if you really really really do need to be able to look up any arbitrary object by absolute index, use an Array instead of a collection. Okay yeah, you use some type safety, but you can index array elements with a long.
The fact that the framework uses Int32 for Count/Length properties, indexers etc is a bit of a red herring. The real problem is that the CLR currently has a max object size restriction of 2GB.
So a string -- or any other single object -- can never be larger than 2GB.
Changing the Length property of the string type to return long, ulong or even BigInteger would be pointless since you could never have more than approx 2^30 characters anyway (2GB max size and 2 bytes per character.)
Similarly, because of the 2GB limit, the only arrays that could even approach having 2^31 elements would be bool[] or byte[] arrays that only use 1 byte per element.
Of course, there's nothing to stop you creating your own composite types to workaround the 2GB restriction.
(Note that the above observations apply to Microsoft's current implementation, and could very well change in future releases. I'm not sure whether Mono has similar limits.)
In versions of .NET prior to 4.5, the maximum object size is 2GB. From 4.5 onwards you can allocate larger objects if gcAllowVeryLargeObjects is enabled. Note that the limit for string is not affected, but "arrays" should cover "lists" too, since lists are backed by arrays.
Even in x64 versions of Windows I got hit by .Net limiting each object to 2GB.
2GB is pretty small for a medical image. 2GB is even small for a Visual Studio download image.
If you are working with a file that is 2GB, that means you're likely going to be using a lot of RAM, and you're seeing very slow performance.
Instead, for very large files, consider using a MemoryMappedFile (see: http://msdn.microsoft.com/en-us/library/system.io.memorymappedfiles.memorymappedfile.aspx). Using this method, you can work with a file of nearly unlimited size, without having to load the whole thing in memory.

the internals of System.String

I used reflection to look at the internal fields of System.String and I found three fields:
m_arrayLength
m_stringLength
m_firstChar
I don't understand how this works.
m_arrayLength is the length of some array. Where is this array? It's apparently not a member field of the string class.
m_stringLength makes sense. It's the length of the string.
m_firstChar is the first character in the string.
So my question is where are the rest of the characters for the string? Where are the contents of the string stored if not in the string class?
The first char provides access (via &m_firstChar) to an address in memory of the first character in the buffer. The length tells it how many characters are in the string, making .Length efficient (better than looking for a nul char). Note that strings can be oversized (especially if created with StringBuilder, and a few other scenarios), so sometimes the actual buffer is actually longer than the string. So it is important to track this. StringBuilder, for example, actually mutates a string within its buffer, so it needs to know how much it can add before having to create a larger buffer (see AppendInPlace, for example).
Much of the implementation of System.String is in native code (C/C++) and not in managed code (C#). If you take a look at the decompiled code you'll see that most of the "interesting" or "core" methods are decorated with this attribute:
[MethodImpl(MethodImplOptions.InternalCall)]
Only some of the helper/convenience APIs are implemented in C#.
So where are the characters for the string stored? It's top secret! Deep down inside the CLR's core native code implementation.
I'd be thinking immediately that m_firstChar is not the first character, rather a pointer to the first character. That would make much more sense (although, since I'm not privy to the source, I can't be certain).
It makes little sense to store the first character of a string unless you want a blindingly fast s.substring(0,1) operation :-) There's a good chance the characters themselves (that the three fields allude to) will be allocated separately from the actual object.
Correct answer on difference between string and System.string is here: string vs System.String
There is nothing about native implementations

Categories