High bits flags of the string length - c#

There is an old article talking about some string internals in .NET/C#. One of the interesting tidbits:
m_stringLength
This is the logical length of the string, the one returned by String.Length.
Because a number of high bits are used for additional flags to enhance performance, the maximum length of the string is constrained to a limit much smaller than UInt32.Max for 32bit systems. Some of these flags indicate the string contains simple characters such as plain ASCII and will not required invoking complex UNICODE algorithms for sorting and comparison tests.
I know that BinaryReader does read strings as length-prefixed with 7bit-encoded integer, does that mean the extra space is used for the aforementioned string flag (0 - ASCII, 1 - wide)?
Is this relevant for mono starting from version 2.0 and above? I'm writing a simple custom wrapper around a string to make it mutable and although that string is not gonna be used in sorting or comparisons (for now) - I was wondering if I should allocate new string pre-emptively filled with ASCII or UNICODE (i.e. if I know/assume the content) char so the flag will be set by default.

Related

How are String and Char types stored in memory in .NET?

I'd need to store a language code string, such as "en", which will always contains 2 characters.
Is it better to define the type as "String" or "Char"?
private string languageCode;
vs
private char[] languageCode;
Or is there another, better option?
How are these 2 stored in memory? how many bytes or bits for will be allocated to them when values assigned?
How They Are Stored
Both the string and the char[] are stored on the heap - so storage is the same. Internally I would assume a string simply is a cover for char[] with lots of extra code to make it useful for you.
Also if you have lots of repeating strings, you can make use of Interning to reduce the memory footprint of those strings.
The Better Option
I would favour string - it is immediately more apparent what the data type is and how you intend to use it. People are also more accustomed to using strings so maintainability won't suffer. You will also benefit greatly from all the boilerplate code that has been done for you. Microsoft have also put a lot of effort in to make sure the string type is not a performance hog.
The Allocation Size
I have no idea how much is allocated, I believe strings are quite efficient in that they only allocate enough to store the Unicode characters - as they are immutable it is safe to do this. Arrays also cannot be resized without allocating the space in a new array, so I'd again assume they grab only what they need.
Overhead of a .NET array?
Alternatives
Based on your information that there are only 20 language codes and performance is key, you could declare your own enum in order to reduce the size required to represent the codes:
enum LanguageCode : byte
{
en = 0,
}
This will only take 1 byte as opposed to 4+ for two char (in an array), but it does limit the range of available LanguageCode values to the range of byte - which is more than big enough for 20 items.
You can see the size of value types using the sizeof() operator: sizeof(LanguageCode). Enums are nothing but the underlying type under the hood, they default to int, but as you can see in my code sample you can change that by "inheriting" a new type.
Short answer: Use string
Long answer:
private string languageCode;
AFAIK strings are stored as a length prefixed array of chars. A String object is instantiated on the heap to maintain this raw array. But a String object is much more than a simple array it enables basic string operations like comparison, concatenation, substring extraction, search etc
While
private char[] languageCode;
will be stored as an Array of chars i.e. an Array object will be created on the heap and then it will be used to manage your characters. But it still has a length attribute which is stored internally so there are no apparent savings in memory when compared to a string. Though presumably an Array is simpler than a String and may have fewer internal variables thus offering a lower memory foot print (this needs to be verified).
But OTOH you loose the ability to perform string operations on this char array. Even operations like string comparison become cumbersome now. So long story short use a string!
How are these 2 stored in memory? how many bytes or bits for will be allocated to them when values assigned?
Every instance in .NET is stored as follows: one IntPtr-sized field for the type identifier; one more for locking on the instance; the remainder is instance field data rounded up to an IntPtr-sized amount. Hence, on a 32-bit platform every instance occupies 8 bytes + field data.
This applies to both a string and a char[]. Both of these also store the length of the data as an IntPtr-sized integer, followed by the actual data. Thus, a two-character string and a two-character char[], on a 32-bit platform, will occupy 8+4+4 = 16 bytes.
The only way to reduce this when storing exactly two characters is to store the actual characters, or a struct containing the characters, in a field or an array. All of these would consume only 4 bytes for the characters:
// Option 1
class MyClass
{
char Char1, Char2;
}
// Option 2
class MyClass
{
CharStruct chars;
}
...
struct CharStruct { public char Char1; public char Char2; }
MyClass will end up using 8 bytes (on a 32-bit machine) per instance plus the 4 bytes for the chars.
// Option 3
class MyClass
{
CharStruct[] chars;
}
This will use 8 bytes for the MyClass overhead, plus 4 bytes for the chars reference, plus 12 bytes for the array's overhead, plus 4 bytes per CharStruct in the array.
If you want to store exactly 2 chars, and do it most efficiently, use a struct:
struct Char2
{
public char C1, C2;
}
Using this struct will generally not cause new heap allocations. It will just upsize an existing object (by the minimum possible amount) or consume stack space which is very cheap.
Strings indeed have a size overhead of one pointer length, i.e. 4 bytes for a 32 bit process, 8 bytes for a 64 bit process. But then again, strings offer so much more in return than char arrays.
If your application uses many short strings and you don't need to use their string properties and methods that often, you could probably safe a few bytes of memory. But if you want to use any of them as a string, you will first have to create a new string instance. I can't see how this will help you safe enough memory to be worth the trouble.
String just implements an indexer of type char internally and we can say that string is just equivalent to char[] type with lots of extra code to make it useful for you, hence, like an array, it is stored on heap always.
An array cannot be manipulated without allocating it new space, same will be the case of a string hence, it is immutable
String implements IEnumerable<char>
Noticeable point: When you pass a string to a function, it is a pass by value unless there is a use of ref

C# big-endian UCS-2

The project I'm currently working on needs to interface with a client system that we don't make, so we have no control over how data is sent either way. The problem is that were working in C#, which doesn't seem to have any support for UCS-2 and very little support for big-endian. (as far as i can tell)
What I would like to know, is if there's anything i looked over in .net, or something that someone else has made and released that we can use. If not I will take a crack at encoding/decoding it in a custom method, if that's even possible.
But thanks for your time either way.
EDIT:
BigEndianUnicode does work to correctly decode the string, the problem was in receiving other data as big endian, so far using IPAddress.HostToNetworkOrder() as suggested elsewhere has allowed me to decode half of the string (Merli? is what comes up and it should be Merlin33069)
Im combing the short code to see if theres another length variable i missed
RESOLUTION:
after working out that the bigendian variables was the main problem, i went back through and reviewed the details and it seems that the length of the strings was sent in character counts, not byte counts (in utf it would seem a char is two bytes) all i needed to do was double it, and it worked out. thank you all for your help.
string x = "abc";
byte[] data = Encoding.BigEndianUnicode.GetBytes(x);
In other direction:
string decodedX = Encoding.BigEndianUnicode.GetString(data);
It is not exactly UCS-2 but it is enough for most cases.
UPD: Unicode FAQ
Q: What is the difference between UCS-2 and UTF-16?
A: UCS-2 is obsolete terminology which refers to a Unicode
implementation up to Unicode 1.1, before surrogate code points and
UTF-16 were added to Version 2.0 of the standard. This term should now
be avoided.
UCS-2 does not define a distinct data format, because UTF-16 and UCS-2
are identical for purposes of data exchange. Both are 16-bit, and have
exactly the same code unit representation.
Sometimes in the past an implementation has been labeled "UCS-2" to
indicate that it does not support supplementary characters and doesn't
interpret pairs of surrogate code points as characters. Such an
implementation would not handle processing of character properties,
code point boundaries, collation, etc. for supplementary characters.
EDIT: Now we know that the problem isn't in the encoding of the text data but in the encoding of the length. There are a few options:
Reverse the bytes and then use the built-in BitConverter code (which I assume is what you're using now; that or BinaryReader)
Perform the conversion yourself using repeated "add and shift" operations
Use my EndianBitConverter or EndianBinaryReader classes from MiscUtil, which are like BitConverter and BinaryReader, but let you specify the endianness.
You may be looking for Encoding.BigEndianUnicode. That's the big-endian UTF-16 encoding, which isn't strictly speaking the same as UCS-2 (as pointed out by Marc) but should be fine unless you give it strings including characters outside the BMP (i.e. above U+FFFF), which can't be represented in UCS-2 but are represented in UTF-16.
From the Wikipedia page:
The older UCS-2 (2-byte Universal Character Set) is a similar character encoding that was superseded by UTF-16 in version 2.0 of the Unicode standard in July 1996.2 It produces a fixed-length format by simply using the code point as the 16-bit code unit and produces exactly the same result as UTF-16 for 96.9% of all the code points in the range 0-0xFFFF, including all characters that had been assigned a value at that time.
I find it highly unlikely that the client system is sending you characters where there's a difference (which is basically the surrogate pairs, which are permanently reserved for that use anyway).
UCS-2 is so close to UTF-16 that Encoding.BigEndianUnicode will almost always suffice.
The issue (comments) around reading the length prefix (as big-endian) is more correctly resolved via shift operations, which will do the right thing on all systems. For example:
Read4BytesIntoBuffer(buffer);
int len =(buffer[0] << 24) | (buffer[1] << 16) | (buffer[2] << 8) | (buffer[3]);
This will then work the same (at parsing a big-endian 4 byte int) on any system, regardless of local endianness.

Are C# Strings (and other .NET API's) limited to 2GB in size?

Today I noticed that C#'s String class returns the length of a string as an Int. Since an Int is always 32-bits, no matter what the architecture, does this mean that a string can only be 2GB or less in length?
A 2GB string would be very unusual, and present many problems along with it. However, most .NET api's seem to use 'int' to convey values such as length and count. Does this mean we are forever limited to collection sizes which fit in 32-bits?
Seems like a fundamental problem with the .NET API's. I would have expected things like count and length to be returned via the equivalent of 'size_t'.
Seems like a fundamental problem with
the .NET API...
I don't know if I'd go that far.
Consider almost any collection class in .NET. Chances are it has a Count property that returns an int. So this suggests the class is bounded at a size of int.MaxValue (2147483647). That's not really a problem; it's a limitation -- and a perfectly reasonable one, in the vast majority of scenarios.
Anyway, what would the alternative be? There's uint -- but that's not CLS-compliant. Then there's long...
What if Length returned a long?
An additional 32 bits of memory would be required anywhere you wanted to know the length of a string.
The benefit would be: we could have strings taking up billions of gigabytes of RAM. Hooray.
Try to imagine the mind-boggling cost of some code like this:
// Lord knows how many characters
string ulysses = GetUlyssesText();
// allocate an entirely new string of roughly equivalent size
string schmulysses = ulysses.Replace("Ulysses", "Schmulysses");
Basically, if you're thinking of string as a data structure meant to store an unlimited quantity of text, you've got unrealistic expectations. When it comes to objects of this size, it becomes questionable whether you have any need to hold them in memory at all (as opposed to hard disk).
Correct, the maximum length would be the size of Int32, however you'll likely run into other memory issues if you're dealing with strings larger than that anyway.
At some value of String.length() probably about 5MB its not really practical to use String anymore. String is optimised for short bits of text.
Think about what happens when you do
msString += " more chars"
Something like:
System calculates length of myString plus length of " more chars"
System allocates that amount of memory
System copies myString to new memory location
System copies " more chars" to new memory location after last copied myString char
The original myString is left to the mercy of the garbage collector.
While this is nice and neat for small bits of text its a nightmare for large strings, just finding 2GB of contiguous memory is probably a showstopper.
So if you know you are handling more than a very few MB of characters use one of the *Buffer classes.
It's pretty unlikely that you'll need to store more than two billion objects in a single collection. You're going to incur some pretty serious performance penalties when doing enumerations and lookups, which are the two primary purposes of collections. If you're dealing with a data set that large, There is almost assuredly some other route you can take, such as splitting up your single collection into many smaller collections that contain portions of the entire set of data you're working with.
Heeeey, wait a sec.... we already have this concept -- it's called a dictionary!
If you need to store, say, 5 billion English strings, use this type:
Dictionary<string, List<string>> bigStringContainer;
Let's make the key string represent, say, the first two characters of the string. Then write an extension method like this:
public static string BigStringIndex(this string s)
{
return String.Concat(s[0], s[1]);
}
and then add items to bigStringContainer like this:
bigStringContainer[item.BigStringIndex()].Add(item);
and call it a day. (There are obviously more efficient ways you could do that, but this is just an example)
Oh, and if you really really really do need to be able to look up any arbitrary object by absolute index, use an Array instead of a collection. Okay yeah, you use some type safety, but you can index array elements with a long.
The fact that the framework uses Int32 for Count/Length properties, indexers etc is a bit of a red herring. The real problem is that the CLR currently has a max object size restriction of 2GB.
So a string -- or any other single object -- can never be larger than 2GB.
Changing the Length property of the string type to return long, ulong or even BigInteger would be pointless since you could never have more than approx 2^30 characters anyway (2GB max size and 2 bytes per character.)
Similarly, because of the 2GB limit, the only arrays that could even approach having 2^31 elements would be bool[] or byte[] arrays that only use 1 byte per element.
Of course, there's nothing to stop you creating your own composite types to workaround the 2GB restriction.
(Note that the above observations apply to Microsoft's current implementation, and could very well change in future releases. I'm not sure whether Mono has similar limits.)
In versions of .NET prior to 4.5, the maximum object size is 2GB. From 4.5 onwards you can allocate larger objects if gcAllowVeryLargeObjects is enabled. Note that the limit for string is not affected, but "arrays" should cover "lists" too, since lists are backed by arrays.
Even in x64 versions of Windows I got hit by .Net limiting each object to 2GB.
2GB is pretty small for a medical image. 2GB is even small for a Visual Studio download image.
If you are working with a file that is 2GB, that means you're likely going to be using a lot of RAM, and you're seeing very slow performance.
Instead, for very large files, consider using a MemoryMappedFile (see: http://msdn.microsoft.com/en-us/library/system.io.memorymappedfiles.memorymappedfile.aspx). Using this method, you can work with a file of nearly unlimited size, without having to load the whole thing in memory.

C#: String -> MD5 -> Hex

in languages like PHP or Python there are convenient functions to turn an input string into an output string that is the HEXed representation of it.
I find it a very common and useful task (password storing and checking, checksum of file content..), but in .NET, as far as I know, you can only work on byte streams.
A function to do the work is easy to put on (eg http://blog.stevex.net/index.php/c-code-snippet-creating-an-md5-hash-string/), but I'd like to know if I'm missing something, using the wrong pattern or there is simply no such thing in .NET.
Thanks
The method you linked to seems right, a slightly different method is showed on the MSDN C# FAQ
A comment suggests you can use:
System.Web.Security.FormsAuthentication.HashPasswordForStoringInConfigFile(string, "MD5");
Yes you can only work with bytes (as far as I know). But you can turn those bytes easily into their hex representation by looping through them and doing something like:
myByte.ToString("x2");
And you can get the bytes that make up the string using:
System.Text.Encoding.UTF8.GetBytes(myString);
So it could be done in a couple lines.
One problem is with the very concept of "the HEXed representation of [a string]".
A string is a sequence of characters. How those characters are represented as individual bits depends on the encoding. The "native" encoding to .NET is UTF-16, but usually a more compact representation is achieved (while preserving the ability to encode any string) using UTF-8.
You can use Encoding.GetBytes to get the encoded version of a string once you've chosen an appropriate encoding - but the fact that there is that choice to make is the reason that there aren't many APIs which go straight from string to base64/hex or which perform encryption/hashing directly on strings. Any such APIs which do exist will almost certainly be doing the "encode to a byte array, perform appropriate binary operation, decode opaque binary data to hex/base64".
(That makes me wonder whether it wouldn't be worth writing a utility class which could take an encoding, a Func<byte[], byte[]> and an output format such as hex/base64 - that could represent an arbitrary binary operation applied to a string.)

Create SecureString from unmanaged unicode string

I am wanting to try to tie the CryptUnprotectData windows API function and the .net SecureString together the best way possible. CryptUnprotectData returns a DATA_BLOB structure consisting of an array of bytes and a byte length. In my program this will be a Unicode UTF-16 string. SecureString has a constructor which takes a char* and length params, so I would like to be able to do something like:
SecureString ss = SecureString((char*)textBlob.pbData, textBlob.cbData / 2);
This works, except UTF-16 is variable length, so I don't really know what to use as the length argument. The above example assumes 2 byte characters (BMP), but for other planes it could be up to 4 bytes. I need to know the number of UTF-16 characters in the byte array. What is the best way to do this without copying the values around in memory (thereby compromising security). I plan on zeroing out and freeing the byte array as quickly as possible.
Most of the Windows API deals with UTF-16 code points as far as I'm aware - in other words, you treat surrogate pairs as two code points instead of a single character. Given that the constructor for SecureString is dealing with a pointer to .NET System.Char values (which are UTF-16) I think the code snippet you've got is fine - the number of elements in pbData is half its size in bytes.
For instance if pbData contained (just) a surrogate pair, cbData would be 4 and you'd still want to pass in 2 as the second argument - because that's the number of System.Char values you're constructing the SecureString from. The fact that it's one non-BMP unicode character is irrelevant to the number of UTF-16 System.Char values it's represented in.
(And yes, the support for non-BMP data is a bit of a mess, and I suspect very few people get it right everywhere. I'm sure I don't. Fortunately in many places you don't need to worry...)

Categories