Create SecureString from unmanaged unicode string - c#

I am wanting to try to tie the CryptUnprotectData windows API function and the .net SecureString together the best way possible. CryptUnprotectData returns a DATA_BLOB structure consisting of an array of bytes and a byte length. In my program this will be a Unicode UTF-16 string. SecureString has a constructor which takes a char* and length params, so I would like to be able to do something like:
SecureString ss = SecureString((char*)textBlob.pbData, textBlob.cbData / 2);
This works, except UTF-16 is variable length, so I don't really know what to use as the length argument. The above example assumes 2 byte characters (BMP), but for other planes it could be up to 4 bytes. I need to know the number of UTF-16 characters in the byte array. What is the best way to do this without copying the values around in memory (thereby compromising security). I plan on zeroing out and freeing the byte array as quickly as possible.

Most of the Windows API deals with UTF-16 code points as far as I'm aware - in other words, you treat surrogate pairs as two code points instead of a single character. Given that the constructor for SecureString is dealing with a pointer to .NET System.Char values (which are UTF-16) I think the code snippet you've got is fine - the number of elements in pbData is half its size in bytes.
For instance if pbData contained (just) a surrogate pair, cbData would be 4 and you'd still want to pass in 2 as the second argument - because that's the number of System.Char values you're constructing the SecureString from. The fact that it's one non-BMP unicode character is irrelevant to the number of UTF-16 System.Char values it's represented in.
(And yes, the support for non-BMP data is a bit of a mess, and I suspect very few people get it right everywhere. I'm sure I don't. Fortunately in many places you don't need to worry...)

Related

High bits flags of the string length

There is an old article talking about some string internals in .NET/C#. One of the interesting tidbits:
m_stringLength
This is the logical length of the string, the one returned by String.Length.
Because a number of high bits are used for additional flags to enhance performance, the maximum length of the string is constrained to a limit much smaller than UInt32.Max for 32bit systems. Some of these flags indicate the string contains simple characters such as plain ASCII and will not required invoking complex UNICODE algorithms for sorting and comparison tests.
I know that BinaryReader does read strings as length-prefixed with 7bit-encoded integer, does that mean the extra space is used for the aforementioned string flag (0 - ASCII, 1 - wide)?
Is this relevant for mono starting from version 2.0 and above? I'm writing a simple custom wrapper around a string to make it mutable and although that string is not gonna be used in sorting or comparisons (for now) - I was wondering if I should allocate new string pre-emptively filled with ASCII or UNICODE (i.e. if I know/assume the content) char so the flag will be set by default.

C# What is the difference between Text.Encoder and Text.Encoding

I am currently using Unicode in bytes and using Encoding class to get bytes and get strings.
However, I saw there is an encoder class and it seems like doing the same thing as the encoding class. Does anyone know what is the difference between them and when to use either of them.
Here are the Microsoft documentation page:
Encoder: https://msdn.microsoft.com/en-us/library/system.text.encoder(v=vs.110).aspx
Encoding: https://msdn.microsoft.com/en-us/library/system.text.encoding(v=vs.110).aspx
There is definitely a difference. An Encoding is an algorithm for transforming a sequence of characters into bytes and vice versa. An Encoder is a stateful object that transforms sequences of characters into bytes. To get an Encoder object you usually call GetEncoder on an Encoding object. Why is it necessary to have a stateful tranformation? Imagine you are trying to efficiently encode long sequences of characters. You want to avoid creating a lot of arrays or one huge array. So you break the characters down into say reusable 1K character buffers. However this might make some illegal characters sequences, for example a utf-16 surrogate pair broken across to separate calls to GetBytes. The Encoder object knows how to handle this and saves the necessary state across successive calls to GetBytes. Thus you use an Encoder for transforming one block of text that is self-contained. I believe you can reuse an Encoder instance more transforms of multiple sections of text as long as you have called GetBytes with flush equal to true on the last array of characters. If you just want to easily encode short strings, use the Encoding.GetBytes methods. For the decoding operations there is a similar Decoder class that holds the decoding state.

What is the most significant byte of 160 bit hash for arithmetic operations?

Could somebody help me to understand what is the most significant byte of a 160 bit (SHA-1) hash?
I have a C# code which calls the cryptography library to calculate a hash code from a data stream. In the result I get a 20 byte C# array. Then I calculate another hash code from another data stream and then I need to place the hash codes in ascending order.
Now, I'm trying to understand how to compare them right. Apparently I need to subtract one from another and then check if the result is negative, positive or zero. Technically, I have 2 20 byte arrays, which if we look at from the memory perspective having the least significant byte at the beginning (lower memory address) and the most significant byte at the end (higher memory address). On the other hand looking at them from the human reading perspective the most significant byte is at the beginning and the least significant is at the end and if I'm not mistaken this order is used for comparing GUIDs. Of course, it will give us different order if we use one or another approach. Which way is considered to be the right or conventional one for comparing hash codes? It is especially important in our case because we are thinking about implementing a distributed hash table which should be compatible with existing ones.
You should think of the initial hash as just bytes, not a number. If you're trying to order them for indexed lookup, use whatever ordering is simplest to implement - there's no general purpose "right" or "conventional" here, really.
If you've got some specific hash table you want to be "compatible" with (not even sure what that would mean) you should see what approach to ordering that hash table takes, assuming it's even relevant. If you've got multiple tables you need to be compatible with, you may find you need to use different ordering for different tables.
Given the comments, you're trying to work with Kademlia, which based on this document treats the hashes as big-endian numbers:
Kademlia follows Pastry in interpreting keys (including nodeIDs) as bigendian numbers. This means that the low order byte in the byte array representing the key is the most significant byte and so if two keys are close together then the low order bytes in the distance array will be zero.
That's just an arbitrary interpretation of the bytes - so long as everyone uses the same interpretation, it will work... but it would work just as well if everyone decided to interpret them as little-endian numbers.
You can use SequenceEqual to compare Byte arrays, check the following links for elaborate details:
How to compare two arrays of bytes
Comparing two byte arrays in .NET

How are String and Char types stored in memory in .NET?

I'd need to store a language code string, such as "en", which will always contains 2 characters.
Is it better to define the type as "String" or "Char"?
private string languageCode;
vs
private char[] languageCode;
Or is there another, better option?
How are these 2 stored in memory? how many bytes or bits for will be allocated to them when values assigned?
How They Are Stored
Both the string and the char[] are stored on the heap - so storage is the same. Internally I would assume a string simply is a cover for char[] with lots of extra code to make it useful for you.
Also if you have lots of repeating strings, you can make use of Interning to reduce the memory footprint of those strings.
The Better Option
I would favour string - it is immediately more apparent what the data type is and how you intend to use it. People are also more accustomed to using strings so maintainability won't suffer. You will also benefit greatly from all the boilerplate code that has been done for you. Microsoft have also put a lot of effort in to make sure the string type is not a performance hog.
The Allocation Size
I have no idea how much is allocated, I believe strings are quite efficient in that they only allocate enough to store the Unicode characters - as they are immutable it is safe to do this. Arrays also cannot be resized without allocating the space in a new array, so I'd again assume they grab only what they need.
Overhead of a .NET array?
Alternatives
Based on your information that there are only 20 language codes and performance is key, you could declare your own enum in order to reduce the size required to represent the codes:
enum LanguageCode : byte
{
en = 0,
}
This will only take 1 byte as opposed to 4+ for two char (in an array), but it does limit the range of available LanguageCode values to the range of byte - which is more than big enough for 20 items.
You can see the size of value types using the sizeof() operator: sizeof(LanguageCode). Enums are nothing but the underlying type under the hood, they default to int, but as you can see in my code sample you can change that by "inheriting" a new type.
Short answer: Use string
Long answer:
private string languageCode;
AFAIK strings are stored as a length prefixed array of chars. A String object is instantiated on the heap to maintain this raw array. But a String object is much more than a simple array it enables basic string operations like comparison, concatenation, substring extraction, search etc
While
private char[] languageCode;
will be stored as an Array of chars i.e. an Array object will be created on the heap and then it will be used to manage your characters. But it still has a length attribute which is stored internally so there are no apparent savings in memory when compared to a string. Though presumably an Array is simpler than a String and may have fewer internal variables thus offering a lower memory foot print (this needs to be verified).
But OTOH you loose the ability to perform string operations on this char array. Even operations like string comparison become cumbersome now. So long story short use a string!
How are these 2 stored in memory? how many bytes or bits for will be allocated to them when values assigned?
Every instance in .NET is stored as follows: one IntPtr-sized field for the type identifier; one more for locking on the instance; the remainder is instance field data rounded up to an IntPtr-sized amount. Hence, on a 32-bit platform every instance occupies 8 bytes + field data.
This applies to both a string and a char[]. Both of these also store the length of the data as an IntPtr-sized integer, followed by the actual data. Thus, a two-character string and a two-character char[], on a 32-bit platform, will occupy 8+4+4 = 16 bytes.
The only way to reduce this when storing exactly two characters is to store the actual characters, or a struct containing the characters, in a field or an array. All of these would consume only 4 bytes for the characters:
// Option 1
class MyClass
{
char Char1, Char2;
}
// Option 2
class MyClass
{
CharStruct chars;
}
...
struct CharStruct { public char Char1; public char Char2; }
MyClass will end up using 8 bytes (on a 32-bit machine) per instance plus the 4 bytes for the chars.
// Option 3
class MyClass
{
CharStruct[] chars;
}
This will use 8 bytes for the MyClass overhead, plus 4 bytes for the chars reference, plus 12 bytes for the array's overhead, plus 4 bytes per CharStruct in the array.
If you want to store exactly 2 chars, and do it most efficiently, use a struct:
struct Char2
{
public char C1, C2;
}
Using this struct will generally not cause new heap allocations. It will just upsize an existing object (by the minimum possible amount) or consume stack space which is very cheap.
Strings indeed have a size overhead of one pointer length, i.e. 4 bytes for a 32 bit process, 8 bytes for a 64 bit process. But then again, strings offer so much more in return than char arrays.
If your application uses many short strings and you don't need to use their string properties and methods that often, you could probably safe a few bytes of memory. But if you want to use any of them as a string, you will first have to create a new string instance. I can't see how this will help you safe enough memory to be worth the trouble.
String just implements an indexer of type char internally and we can say that string is just equivalent to char[] type with lots of extra code to make it useful for you, hence, like an array, it is stored on heap always.
An array cannot be manipulated without allocating it new space, same will be the case of a string hence, it is immutable
String implements IEnumerable<char>
Noticeable point: When you pass a string to a function, it is a pass by value unless there is a use of ref

C# big-endian UCS-2

The project I'm currently working on needs to interface with a client system that we don't make, so we have no control over how data is sent either way. The problem is that were working in C#, which doesn't seem to have any support for UCS-2 and very little support for big-endian. (as far as i can tell)
What I would like to know, is if there's anything i looked over in .net, or something that someone else has made and released that we can use. If not I will take a crack at encoding/decoding it in a custom method, if that's even possible.
But thanks for your time either way.
EDIT:
BigEndianUnicode does work to correctly decode the string, the problem was in receiving other data as big endian, so far using IPAddress.HostToNetworkOrder() as suggested elsewhere has allowed me to decode half of the string (Merli? is what comes up and it should be Merlin33069)
Im combing the short code to see if theres another length variable i missed
RESOLUTION:
after working out that the bigendian variables was the main problem, i went back through and reviewed the details and it seems that the length of the strings was sent in character counts, not byte counts (in utf it would seem a char is two bytes) all i needed to do was double it, and it worked out. thank you all for your help.
string x = "abc";
byte[] data = Encoding.BigEndianUnicode.GetBytes(x);
In other direction:
string decodedX = Encoding.BigEndianUnicode.GetString(data);
It is not exactly UCS-2 but it is enough for most cases.
UPD: Unicode FAQ
Q: What is the difference between UCS-2 and UTF-16?
A: UCS-2 is obsolete terminology which refers to a Unicode
implementation up to Unicode 1.1, before surrogate code points and
UTF-16 were added to Version 2.0 of the standard. This term should now
be avoided.
UCS-2 does not define a distinct data format, because UTF-16 and UCS-2
are identical for purposes of data exchange. Both are 16-bit, and have
exactly the same code unit representation.
Sometimes in the past an implementation has been labeled "UCS-2" to
indicate that it does not support supplementary characters and doesn't
interpret pairs of surrogate code points as characters. Such an
implementation would not handle processing of character properties,
code point boundaries, collation, etc. for supplementary characters.
EDIT: Now we know that the problem isn't in the encoding of the text data but in the encoding of the length. There are a few options:
Reverse the bytes and then use the built-in BitConverter code (which I assume is what you're using now; that or BinaryReader)
Perform the conversion yourself using repeated "add and shift" operations
Use my EndianBitConverter or EndianBinaryReader classes from MiscUtil, which are like BitConverter and BinaryReader, but let you specify the endianness.
You may be looking for Encoding.BigEndianUnicode. That's the big-endian UTF-16 encoding, which isn't strictly speaking the same as UCS-2 (as pointed out by Marc) but should be fine unless you give it strings including characters outside the BMP (i.e. above U+FFFF), which can't be represented in UCS-2 but are represented in UTF-16.
From the Wikipedia page:
The older UCS-2 (2-byte Universal Character Set) is a similar character encoding that was superseded by UTF-16 in version 2.0 of the Unicode standard in July 1996.2 It produces a fixed-length format by simply using the code point as the 16-bit code unit and produces exactly the same result as UTF-16 for 96.9% of all the code points in the range 0-0xFFFF, including all characters that had been assigned a value at that time.
I find it highly unlikely that the client system is sending you characters where there's a difference (which is basically the surrogate pairs, which are permanently reserved for that use anyway).
UCS-2 is so close to UTF-16 that Encoding.BigEndianUnicode will almost always suffice.
The issue (comments) around reading the length prefix (as big-endian) is more correctly resolved via shift operations, which will do the right thing on all systems. For example:
Read4BytesIntoBuffer(buffer);
int len =(buffer[0] << 24) | (buffer[1] << 16) | (buffer[2] << 8) | (buffer[3]);
This will then work the same (at parsing a big-endian 4 byte int) on any system, regardless of local endianness.

Categories