How to (de)construct data frames in WebSockets hybi 08+? - c#

Since Chrome updated to v14, they went from version three of the draft to version eight of the draft.
I have an internal chat application running on WebSocket, and although I've gotten the new handshake working, the data framing apparently has changed as well. My WebSocket server is based on Nugget.
Does anybody have WebSocket working with version eight of the draft and have an example on how to frame the data being sent over the wire?

(See also: How can I send and receive WebSocket messages on the server side?)
It's fairly easy, but it's important to understand the format.
The first byte is almost always 1000 0001, where the 1 means "last frame", the three 0s are reserved bits without any meaning so far and the 0001 means that it's a text frame (which Chrome sends with the ws.send() method).
(Update: Chrome can now also send binary frames with an ArrayBuffer. The last four bits of the first byte will be 0002, so you can differ between text and binary data. The decoding of the data works exactly the same way.)
The second byte contains of a 1 (meaning that it's "masked" (encoded)) followed by seven bits which represent the frame size. If it's between 000 0000 and 111 1101, that's the size. If it's 111 1110, the following 2 bytes are the length (because it wouldn't fit in seven bits), and if it's 111 1111, the following 8 bytes are the length (if it wouldn't fit in two bytes either).
Following that are four bytes which are the "masks" which you need to decode the frame data. This is done using xor encoding which uses one of the masks as defined by indexOfByteInData mod 4 of the data. Decoding simply works like encodedByte xor maskByte (where maskByte is indexOfByteInData mod 4).
Now I must say I'm not experienced with C# at all, but this is some pseudocode (some JavaScript accent I'm afraid):
var length_code = bytes[1] & 127, // remove the first 1 by doing '& 127'
masks,
data;
if(length_code === 126) {
masks = bytes.slice(4, 8); // 'slice' returns part of the byte array
data = bytes.slice(8); // and accepts 'start' (inclusively)
} else if(length_code === 127) { // and 'end' (exclusively) as arguments
masks = bytes.slice(10, 14); // Passing no 'end' makes 'end' the length
data = bytes.slice(14); // of the array
} else {
masks = bytes.slice(2, 6);
data = bytes.slice(6);
}
// 'map' replaces each element in the array as per a specified function
// (each element will be replaced with what is returned by the function)
// The passed function accepts the value and index of the element as its
// arguments
var decoded = data.map(function(byte, index) { // index === 0 for the first byte
return byte ^ masks[ index % 4 ]; // of 'data', not of 'bytes'
// xor mod
});
You can also download the specification which can be helpful (it of course contains everything you need to understand the format).

This c# code works fine for me. Decode text data that comes from a browser to a c# server via socket.
public static string GetDecodedData(byte[] buffer, int length)
{
byte b = buffer[1];
int dataLength = 0;
int totalLength = 0;
int keyIndex = 0;
if (b - 128 <= 125)
{
dataLength = b - 128;
keyIndex = 2;
totalLength = dataLength + 6;
}
if (b - 128 == 126)
{
dataLength = BitConverter.ToInt16(new byte[] { buffer[3], buffer[2] }, 0);
keyIndex = 4;
totalLength = dataLength + 8;
}
if (b - 128 == 127)
{
dataLength = (int)BitConverter.ToInt64(new byte[] { buffer[9], buffer[8], buffer[7], buffer[6], buffer[5], buffer[4], buffer[3], buffer[2] }, 0);
keyIndex = 10;
totalLength = dataLength + 14;
}
if (totalLength > length)
throw new Exception("The buffer length is small than the data length");
byte[] key = new byte[] { buffer[keyIndex], buffer[keyIndex + 1], buffer[keyIndex + 2], buffer[keyIndex + 3] };
int dataIndex = keyIndex + 4;
int count = 0;
for (int i = dataIndex; i < totalLength; i++)
{
buffer[i] = (byte)(buffer[i] ^ key[count % 4]);
count++;
}
return Encoding.ASCII.GetString(buffer, dataIndex, dataLength);
}

To be more accurate, Chrome has gone from the Hixie-76 version of the protocol to the HyBi-10 version of the protocol. HyBi-08 through HyBi-10 all report as version 8 because it was really only the specification text that changed and not the wire format.
The framing has changed from using '\x00...\xff' to using a 2-7 byte header for each frame that contains the length of the payload among other things. There is a diagram of the frame format in section 4.2 of the specification. Also note that data from the client (browser) to the server is masked (4 bytes of the client-server frame headers contain the unmasking key).
You can look at websockify which is a WebSockets to TCP socket proxy/bridge that I created to support noVNC. It is implemented in python but you should be able to get the idea from the encode_hybi and decode_hybi routines.

Related

Better Efficiency for assembling integers from bytes

My application receives data from a serial port, which is send in packets. The packets are defined as following
1 byte - Identifier
2 bytes - lenght of data
n bytes - data
1 bytes - Checksum
For example if the length is specified as 508 there will be 508 bytes, which would be 127 uint32_t values.
Currently I use the following code to assemble the uint32_t values from the data that is sent in bytes:
private UInt32[] number_array = new UInt32[16384];
private void decodePacket(int startpos, byte[] data, int lenght)
{
/* Starting position */
int pos = startpos;
for(int i=0; i<lenght; i++)
{
/* Convert 4 bytes to one uint32_t value */
int value = data[i] | data[i + 1]<<8 | data[i + 2]<<16 | data[i + 3]<<24;
/* Write to array */
number_array[pos] = Convert.ToUInt32(value);
/* Advance i by 4 (bytes */
i += 4;
/* Advance pos */
pos++;
}
}
It does work fine, but I'm thinking it's very inefficient. There are usually 16384 uint32_t values to process, so this function is called a lot of times.
Is there a more efficient / faster way to do this?
Look at this simple code:
static void Main(string[] args)
{
byte[] data = new byte[]
{
1, 10, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 5
};
byte id = data[0];
byte[] len = new byte[4];
Array.Copy(data, 1, len, 0, 2);
int dataLen = BitConverter.ToInt32(len, 0);
byte[] dataRead = new byte[dataLen];
Array.Copy(data, 3, dataRead, 0, dataLen);
byte checksum = data[data.Length - 1];
Console.ReadKey();
}
Now, first you get identifier that is on the first pos.
Next, you get length of the data. You have to create 4 byte array to copy values from data array to this new array. It should be 4 bytes, because you would like to convert those bytes into Int32. You could have this array 2 bytes length and convert it to Int16, but Int32 should have better performance.
So, when you have length data in this len array, you can convert those values into Int32. But be careful of endianes. It may be different.
At the end you create an array that will contain all the REAL data. And then copy the real data to new array.
This solution should be faster than yours. But... Be careful of endianess.
Here is the best possible way using C#
private void decodePacket(int startpos, byte[] data, int lenght)
{
/* Starting position */
int pos = startpos;
for (int i = 0; i < lenght; i += 4)
{
/* Convert 4 bytes to one uint32_t value */
int value = BitConverter.ToInt32(data, i);
/* Write to array */
number_array[pos] = Convert.ToUInt32(value);
/* Advance pos */
pos++;
}
}
This is same code as in question but with two changes.
Index i was being incremented at two places which was resulting in increments of 5 instead of 4. Changed it to just one increment of 4.
Use of BitConveter instead of bitwise logic. Although it might not provide any significant performance boost. It is better to use BitConverter for platform independence.
UPDATE
In case your bytes are stored at 32-bit aligned memory addresses BitConverter provides you with maximum performance conversion. But in C# you cannot guarantee memory location alignment. In that case bit shifting is the only way.
In case of bit shifting BitConverter also uses same logic for little endian systems as shown in question. But, it can help keeping your code platform independent by using another bit shifting pattern for big endian systems.

Microsoft BigInteger goes "negative" when I import from an array

Suppose I have an array that ends with the most signifiant bit of the most significant byte equaling 1.
My understanding is that if this is the case, BigInteger will treat this as a negative number, by design.
BigInteger numberToShorten = new BigInteger(toEncode);
if (numberToShorten.Sign == -1)
{
// problem with twos compliment or the last bit of the last byte is equal to 1
throw new Exception("Unexpected negative number");
}
To solve this problem, I think I need to add a dummy zero bit to the array, prior to converting the array. I can easily do this using Array.Resize().
My question is, how should I test if the last bit is indeed equal to 1?
I'm pretty weak on my boolean logic now, and am thinking I need to AND the values and test for equality, but not able to get the syntax correct in C#. Something like this:
byte temp = toEncode[toEncode.Length - 1];
if (temp == ???)
{
Array.Resize(ref toEncode, toEncode.Length +1);
}
If the number to be converted to BigInteger is always positive then you don't actually need to test at all; just appending a zero byte will always work correctly.
For completeness, checking if the MSB of any one byte is set is done with
if (byte & 0x80 == 0x80) ...
Found the answer at the bottom of this MSDN article (I think)
ulong originalNumber = UInt64.MaxValue;
byte[] bytes = BitConverter.GetBytes(originalNumber);
if (originalNumber > 0 && (bytes[bytes.Length - 1] & 0x80) > 0)
{
byte[] temp = new byte[bytes.Length];
Array.Copy(bytes, temp, bytes.Length);
bytes = new byte[temp.Length + 1];
Array.Copy(temp, bytes, temp.Length);
}
BigInteger newNumber = new BigInteger(bytes);
Console.WriteLine("Converted the UInt64 value {0:N0} to {1:N0}.",
originalNumber, newNumber);
// The example displays the following output:
// Converted the UInt64 value 18,446,744,073,709,551,615 to 18,446,744,073,709,551,615.
I'm not sure if this is necessary, have you tried it? In any case, assuming that temp contains a byte and you are trying to test whether that byte ends in 1, this is how you should do it:
if((temp & 0x01)==1)
{
...
}
That amounts to:
Temp: ???????z
0x01: 00000001
Temp & 0x01: 0000000z
Where z represents the bit you are trying to test, and you can now just test whether the result is equal to 1 or not.
If you are trying to test whether the byte starts with 1, do:
if((temp & 0x80)==0x80) //0x80 is 10000000 in binary
{
...
}

Convert Hex string to Image

I am using FormsConnect an App for the iPad to export signatures to XML.
The output from the App is as follows:
0x1828023f020000001d1100000018000000e02f01000000000000000000000000a888223f010000005259414ca0a37e00000000000000000000000000000000000000000000000000000000000000000010b9611100000000000000000000000000000000308e660e00000000c021070030a46c0e00000000000000000000000044b0fb3e00beb0000100000001000000107b6111000030420000584200000000000000000000000000000000000080bf00000000000000000100000000000000000000000000000000000000000000001cb0fb3e0100000000000000000000000000000000000000000000000047864aa0860100a086010000c05411000070110047864a0000000000000000010000000000000000000000000000000000000018884d3f801200010900000101000000000000000000020074a54e3fb0b6690ec0b6690e00000000000000000000000044b0fb3e0038a9000100000001000000f0e4c008000030420000304200000000000000000000000000000000000080bf00000000000000000100000000000000000000000000000000000000000000001cb0fb3e0100000000000000000000000000000000000000000000000047864aa0860100a08601000020761100407c110047864a0000000000000000010000000000000000000000000000000000000018884d3f801200010900000101000000000000000000020074a54e3f20b6690e30b6690e000000000000000000000000a888223f010000005259414c808b730c00000000000000000000000000000000000000000000000000000000000000000cc0fb3e0000000000000000000000000400000000000000000000000000000000000000000000000000000000000000a888223f010000005259414ce06ebe0a00000000000000000000000000000000000000000000000000000000000000009ceefb3ea08c61110000000000000000000000000000000000000000000000000000000000000000280001c10400000000000000b017640000000000000000000000000000000000000000000000000000000000000020415846fc3ed08c6111000000000000000000000000000000000000000000000000000000000000003f000000004014630000887a0670c89b0c40146300000000000000000000000000000000000000000000000000100000800000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000e44200000000000000000000000000000000408d6111f08e61110000584200003042000000000000000000000000a888223f010000005259414c4040050d00000000000000000000000000000000000000000000000000000000000000002cd7fb3e108d61110000000000000000a08e6111d033120000000000000000000000000000000000200000c10000000000000000000000000000000000000000a888223f010000005259414c304d050d0000000000000000000000000000000000000000000000000000000000000000ccc8fb3eb08d611100000000d08c611100000000908b6111000000000000000000000000000000000000000000000000000000000150000000000000f08d611101000000000000000000e03f0000204100000000000000000000000000000000108e611104000000000000000000000084854d3f01000000110000000300000002000000e08d61110000000000000000f0c7fb3e908b61112952d53200000000d08d6111000000000000000000000000c08a4d3f8011000101000000000000000000000000000100c8c94e3f0000000094c9fb3e0100000000000000608e611100000000000000000000000000000000000000000000000000003442000000000000e83f000000606666d63f808e6111408d6111000000000000000000000000c08a4d3f8011000101000000000000000000000000000100c8c94e3f0000000084854d3f0000000001000000030000000100000000000000000000000000000084854d3f01000000110000000300000002000000c08e61110000000000000000408d6111000000000000000000000000d088223fe003fe07020000000800000002000000000000000000000000000000b4effb3e308f6111000000000000000000000000b0906111000000000000000000000000f0496300200000c100000000608f6111000080bf0200000000000000a888223f010000005259414cf04a050d000000000000000000000000000000000000000000000000000000000000000054a0fb3ed08f6111000000000000000000000000d0331200000000000000000000000000f0496300330000c100000000000000000000000020887a06201c630000000000e0a762001032d805000000000000803f0000000000000000010000000000c841000000000416000000000000a888223f010000005259414c9047050d0000000000000000000000000000000000000000000000000000000000000000d088223fc00bfe070100000008000000010000000000000000000000000000003ca9fb3e00000000b7b6363f9796163ffffefe3e0000803f304277060000000084854d3f01000000110000000300000007000000609061110000000000000000c0b96f0e000000000000000000000000a8894d3f800501012412010050596100000000000000000000000000040000009f9e9e3ea9a8a83eb3b2b23e0000803f00c66d0e00000000000000000000000084854d3f01000000110000000300000002000000c06e6d0e00000000000000009ceefb3ee09161110000000000000000000000000000000000000000000000000000000000000000280001c10400000000000000b017640000000000000000000000000000000000000000000000000000000000000020415846fc3e10926111000000000000000000000000000000000000000000000000000000000000003f000000004014630020629b0c10629a0c40146300000000000000000000000000000000000000000000000000100000800000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000e4420000000000000000000000000000000080926111309461110000584200003042000000000000000000000000a888223f010000005259414c1043050d00000000000000000000000000000000000000000000000000000000000000002cd7fb3e509261110000000000000000e0936111d033120000000000000000000000000000000000200000c10000000000000000000000000000000000000000a888223f010000005259414c804b050d0000000000000000000000000000000000000000000000000000000000000000ccc8fb3ef0926111000000001092611100000000d09061110000000000000000000000000000000000000000000000000000000001500000000000003093611101000000000000000000e03f00002041000000000000000000000000000000005093611104000000000000000000000084854d3f01000000110000000300000002000000209361110000000000000000f0c7fb3ed09061112952d5320000000010936111000000000000000000000000c08a4d3f8011000101000000000000000000000000000100c8c94e3f0000000094c9fb3e0100000000000000a093611100000000000000000000000000000000000000000000000000003442000000000000e83f000000606666d63fc093611180926111000000000000000000000000c08a4d3f8011000101000000000000000000000000000100c8c94e3f0000000084854d3f0000000001000000030000000100000000000000000000000000000084854d3f0100000011000000030000000200000000946111000000000000000080926111000000000000000000000000d088223f6007fe07020000000800000002000000000000000000000000000000b4effb3e7094611100000000000000000000000030976111000000000000000000000000f0496300200000c100000000a0946111000080bf0200000000000000a888223f010000005259414c0047050d000000000000000000000000000000000000000000000000000000000000000054a0fb3e10956111000000000000000000000000d0331200000000000000000000000000f0496300330000c100000000000000000000000040629b0c201c630000000000e0a762001032d805000000000000803f0000000000000000010000000000c841000000000416000000000000a888223f010000005259414c8042050d0000000000000000000000000000000000000000000000000000000000000000d088223fc00ffe07010000000800000001000000000000000000000000000000907c6211000000000000000000000000a0946111000000000000000000000000a8894d3f8006010200000000504f6c0e00100000000000000000000000000000000000000000000045f41836397925360000000000000000000000005854554d000000006028000000000000000000000000000000000000d09561110000000000000000d4956111000000000010ea0d0000000000000000000000000000000000000000000000000000000000000000a8894d3f8005010132e100005059610000000000000000000000000004000000e3e2e23ef1f0f03e8180003f0000803fa888223f010000005259414c5045050d0000000000000000000000000000000000000000000000000000000000000000a888223f010000005259414cc044050d000000000000000000000000000000000000000000000000000000000000000054a0fb3e202fc508000000000000000000000000d0331200000000000000000000000000f0496300330000c10000000000000000000000008054990c201c630000000000e0a762001032d805000000000000803f0000000000000000010000000000c841000000000416000000000000a0966111000000000000000000000000087c070000000000000000000000000084854d3f010000001100000003000000020000007095611100000000000000009ceefb3e409661110000000000000000000000000000000000000000000000000000000000000000280001c10400000000000000b017640000000000000000000000000000000000000000000000000000000000000020415846fc3e60986111000000000000000000000000000000000000000000000000000000000000003f0000000040146300402a930c20a5990c40146300000000000000000000000000000000000000000000000000100000800000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000e44200000000000000000000000000000000a0986111509a611100005842000030420000000000000000000000002cd7fb3e709661110000000000000000009a6111d033120000000000000000000000000000000000200000c10000000000000000000000000000000000000000ccc8fb3e10996111000000006098611100000000509761110000000000000000000000000000000000000000000000000000000001500000000000005099611101000000000000000000e03f00002041000000000000000000000000000000007099611104000000000000000000000084854d3f01000000110000000300000002000000409961110000000000000000f0c7fb3e509761112952d5320000000030996111000000000000000000000000c08a4d3f8011000101000000000000000000000000000100c8c94e3f0000000094c9fb3e0100000000000000c099611100000000000000000000000000000000000000000000000000003442000000000000e83f000000606666d63fe0996111a0986111000000000000000000000000c08a4d3f8011000101000000000000000000000000000100c8c94e3f0000000084854d3f00000000010000000300000001000000000000000000000000
Now I'm trying to convert this Hex string back into an image using this code:
Bitmap newImage = new Bitmap(new MemoryStream(StringToByteArray(hex)));
public static byte[] StringToByteArray(string hex)
{
hex = hex.Substring(2, hex.Length - 2);
if (hex.Length % 2 != 0)
{
throw new ArgumentException(String.Format(CultureInfo.InvariantCulture, "The binary key cannot have an odd number of digits: {0}", hex));
}
byte[] HexAsBytes = new byte[hex.Length / 2];
for (int index = 0; index < HexAsBytes.Length; index++)
{
string byteValue = hex.Substring(index * 2, 2);
HexAsBytes[index] = byte.Parse(byteValue, NumberStyles.HexNumber, CultureInfo.InvariantCulture);
}
return HexAsBytes;
}
But I keep getting Parameter is not valid, when attempting to Load the stream into a Bitmap. I posed the question to the developer of the App, and here is his response:
Dear Alex
The data is stored in binary, hex format. Here are the instructions on how to read the binary data and write it back out as a jpg.
byteData[i] = (digit [i * 2] shl 4) + (digit [i * 2 + 1])
The above assumes that the digits are already converted to numbers from 0 to 15. If you’re trying to optimize your script by writing to memory in larger chunks (integers, for instance) you need to take endianness into account. All major CPU architectures are little-endian which means the numbers are stored with their bytes arranged in the opposite order from their actual value (ie, 0x1234 is stored as 0x3412 in memory).
Below are two different scripts for extracting blob data from the form:
1. Python Script:
iterate over each photo attached to image field in FormConnect xml file
data = R[i]
to strip '0x' off front of each blob
a = data[2:]
iterate over each byte, write as binary int to array
for i in range(len(a)/2):
b = (int(bin(int(a[i*2],16)),2)<<4) + int(bin(int(a[i*2+1],16)),2)
data.append(b)
write array to binary file
F = file(outFile,'wb')
data.tofile(F)
F.close()
PHP Script:
$pic = "";
$field = substr($field,2);
header("Content-Type: image/jpeg");
$pic = pack("H*", $field);
echo $pic;
If anyone can help me on this, I would be very grateful.
Kind regards,
Alex
Your problem isn't the hex conversion routine, it's that the original data isn't encoded properly. Based on all the zeroes within it, I'd say it's just a raw bitmap! My guess is that the author of the program that generated that data forgot to encode the bitmap into a specific file format before outputting it.

Efficent way to sample and convert arrays of binary data using C#

I have a byte array. It contains 24 bit signed integers stored lsb to msb. The array could hold up to 4mb of data. The integers will be converted to 32 bit signed integers to be used in the application. I would like to hear about possible strategies for conversion and sampling of this data.
One thing I need to do with the data is graph it. With sequential sampling, I am worried about loosing some of the important peaks and valleys in the data. I also want to do some calculations to determine the highest and lowest values.
Given what I need to do, are there any algorithms or ways of doing things that will help me achieve my goal quickly and efficiently?
If your input has to be 3 byte ints, then you can convert to 4 byte ints as follows:
byte[] input = new byte[] {1, 2, 3, 4, 5, 6, 7, 8, 9}; //sample data
byte[] buffer = new byte[4]; //4 byte buffer for conversion from 3-> 4 byte int
int[] output = new int[input.Length / 3];
for (int i = 0, j = 0; i < input.Length; i += 3, j++)
{
Buffer.BlockCopy(input, i, buffer, 0, 3);
int signed32 = BitConverter.ToInt32(buffer, 0);
output[j] = signed32;
}
Edit
Fixed block copy for little endian.
I would suggest you to convert the byte array to an int[]. That way, you can work with it easily and today's computers can work with 32-bit integers much better than if you had to work with bytes that represent 24-bit integers all the time.
You should use the regular sized ints.
Storage is cheap (especially if you only need ~4MB of data) and if you are going to convert them to int32's for manipulation it's better if they're in that format from the beginning.
If the conversion will actually produce another array of int32s then you've just doubled the memory footprint. If you convert individual elements you've just increased execution time.
Best use the native int size.
It might be easier to implement and for future developers to understand if you use the bytes directly (3 at a time).
// If you're reading from a file, you don't have to read the whole array.
// Just read a large chunk (like 3 * 1024) bytes at a time (so it's divisible by 3).
byte [] data = new []{1,2,3, 4,5,6, 7,8,9};
int [] values = new [data.Length /3];
int min = int.MaxValue;
int max = int.MaxValue;
for (int i = 0,j = 1; i < data.Length - 2; i += 3, j++)
{
byte b1 = data[i];
byte b2 = data[i+1];
byte b3 = data[i+2];
// Are we dealing with 2's compliment or a sign bit? Let's assume sign bit.
int sign = b3 >> 7 == 1 ? -1 : 1;
int value = sign * ((int) (b3 <1)>1)<<16 + b2 << 8 + b1;
values[j] = value;
max = max > value ? max : value;
min = min < value ? min : value;
}

Best way to shorten UTF8 string based on byte length

A recent project called for importing data into an Oracle database. The program that will do this is a C# .Net 3.5 app and I'm using the Oracle.DataAccess connection library to handle the actual inserting.
I ran into a problem where I'd receive this error message when inserting a particular field:
ORA-12899 Value too large for column X
I used Field.Substring(0, MaxLength); but still got the error (though not for every record).
Finally I saw what should have been obvious, my string was in ANSI and the field was UTF8. Its length is defined in bytes, not characters.
This gets me to my question. What is the best way to trim my string to fix the MaxLength?
My substring code works by character length. Is there simple C# function that can trim a UT8 string intelligently by byte length (ie not hack off half a character) ?
I think we can do better than naively counting the total length of a string with each addition. LINQ is cool, but it can accidentally encourage inefficient code. What if I wanted the first 80,000 bytes of a giant UTF string? That's a lot of unnecessary counting. "I've got 1 byte. Now I've got 2. Now I've got 13... Now I have 52,384..."
That's silly. Most of the time, at least in l'anglais, we can cut exactly on that nth byte. Even in another language, we're less than 6 bytes away from a good cutting point.
So I'm going to start from #Oren's suggestion, which is to key off of the leading bit of a UTF8 char value. Let's start by cutting right at the n+1th byte, and use Oren's trick to figure out if we need to cut a few bytes earlier.
Three possibilities
If the first byte after the cut has a 0 in the leading bit, I know I'm cutting precisely before a single byte (conventional ASCII) character, and can cut cleanly.
If I have a 11 following the cut, the next byte after the cut is the start of a multi-byte character, so that's a good place to cut too!
If I have a 10, however, I know I'm in the middle of a multi-byte character, and need to go back to check to see where it really starts.
That is, though I want to cut the string after the nth byte, if that n+1th byte comes in the middle of a multi-byte character, cutting would create an invalid UTF8 value. I need to back up until I get to one that starts with 11 and cut just before it.
Code
Notes: I'm using stuff like Convert.ToByte("11000000", 2) so that it's easy to tell what bits I'm masking (a little more about bit masking here). In a nutshell, I'm &ing to return what's in the byte's first two bits and bringing back 0s for the rest. Then I check the XX from XX000000 to see if it's 10 or 11, where appropriate.
I found out today that C# 6.0 might actually support binary representations, which is cool, but we'll keep using this kludge for now to illustrate what's going on.
The PadLeft is just because I'm overly OCD about output to the Console.
So here's a function that'll cut you down to a string that's n bytes long or the greatest number less than n that's ends with a "complete" UTF8 character.
public static string CutToUTF8Length(string str, int byteLength)
{
byte[] byteArray = Encoding.UTF8.GetBytes(str);
string returnValue = string.Empty;
if (byteArray.Length > byteLength)
{
int bytePointer = byteLength;
// Check high bit to see if we're [potentially] in the middle of a multi-byte char
if (bytePointer >= 0
&& (byteArray[bytePointer] & Convert.ToByte("10000000", 2)) > 0)
{
// If so, keep walking back until we have a byte starting with `11`,
// which means the first byte of a multi-byte UTF8 character.
while (bytePointer >= 0
&& Convert.ToByte("11000000", 2) != (byteArray[bytePointer] & Convert.ToByte("11000000", 2)))
{
bytePointer--;
}
}
// See if we had 1s in the high bit all the way back. If so, we're toast. Return empty string.
if (0 != bytePointer)
{
returnValue = Encoding.UTF8.GetString(byteArray, 0, bytePointer); // hat tip to #NealEhardt! Well played. ;^)
}
}
else
{
returnValue = str;
}
return returnValue;
}
I initially wrote this as a string extension. Just add back the this before string str to put it back into extension format, of course. I removed the this so that we could just slap the method into Program.cs in a simple console app to demonstrate.
Test and expected output
Here's a good test case, with the output it create below, written expecting to be the Main method in a simple console app's Program.cs.
static void Main(string[] args)
{
string testValue = "12345“”67890”";
for (int i = 0; i < 15; i++)
{
string cutValue = Program.CutToUTF8Length(testValue, i);
Console.WriteLine(i.ToString().PadLeft(2) +
": " + Encoding.UTF8.GetByteCount(cutValue).ToString().PadLeft(2) +
":: " + cutValue);
}
Console.WriteLine();
Console.WriteLine();
foreach (byte b in Encoding.UTF8.GetBytes(testValue))
{
Console.WriteLine(b.ToString().PadLeft(3) + " " + (char)b);
}
Console.WriteLine("Return to end.");
Console.ReadLine();
}
Output follows. Notice that the "smart quotes" in testValue are three bytes long in UTF8 (though when we write the chars to the console in ASCII, it outputs dumb quotes). Also note the ?s output for the second and third bytes of each smart quote in the output.
The first five characters of our testValue are single bytes in UTF8, so 0-5 byte values should be 0-5 characters. Then we have a three-byte smart quote, which can't be included in its entirety until 5 + 3 bytes. Sure enough, we see that pop out at the call for 8.Our next smart quote pops out at 8 + 3 = 11, and then we're back to single byte characters through 14.
0: 0::
1: 1:: 1
2: 2:: 12
3: 3:: 123
4: 4:: 1234
5: 5:: 12345
6: 5:: 12345
7: 5:: 12345
8: 8:: 12345"
9: 8:: 12345"
10: 8:: 12345"
11: 11:: 12345""
12: 12:: 12345""6
13: 13:: 12345""67
14: 14:: 12345""678
49 1
50 2
51 3
52 4
53 5
226 â
128 ?
156 ?
226 â
128 ?
157 ?
54 6
55 7
56 8
57 9
48 0
226 â
128 ?
157 ?
Return to end.
So that's kind of fun, and I'm in just before the question's five year anniversary. Though Oren's description of the bits had a small error, that's exactly the trick you want to use. Thanks for the question; neat.
Here are two possible solution - a LINQ one-liner processing the input left to right and a traditional for-loop processing the input from right to left. Which processing direction is faster depends on the string length, the allowed byte length, and the number and distribution of multibyte characters and is hard to give a general suggestion. The decision between LINQ and traditional code I probably a matter of taste (or maybe speed).
If speed matters, one could think about just accumulating the byte length of each character until reaching the maximum length instead of calculating the byte length of the whole string in each iteration. But I am not sure if this will work because I don't know UTF-8 encoding well enough. I could theoreticaly imagine that the byte length of a string does not equal the sum of the byte lengths of all characters.
public static String LimitByteLength(String input, Int32 maxLength)
{
return new String(input
.TakeWhile((c, i) =>
Encoding.UTF8.GetByteCount(input.Substring(0, i + 1)) <= maxLength)
.ToArray());
}
public static String LimitByteLength2(String input, Int32 maxLength)
{
for (Int32 i = input.Length - 1; i >= 0; i--)
{
if (Encoding.UTF8.GetByteCount(input.Substring(0, i + 1)) <= maxLength)
{
return input.Substring(0, i + 1);
}
}
return String.Empty;
}
Shorter version of ruffin's answer. Takes advantage of the design of UTF8:
public static string LimitUtf8ByteCount(this string s, int n)
{
// quick test (we probably won't be trimming most of the time)
if (Encoding.UTF8.GetByteCount(s) <= n)
return s;
// get the bytes
var a = Encoding.UTF8.GetBytes(s);
// if we are in the middle of a character (highest two bits are 10)
if (n > 0 && ( a[n]&0xC0 ) == 0x80)
{
// remove all bytes whose two highest bits are 10
// and one more (start of multi-byte sequence - highest bits should be 11)
while (--n > 0 && ( a[n]&0xC0 ) == 0x80)
;
}
// convert back to string (with the limit adjusted)
return Encoding.UTF8.GetString(a, 0, n);
}
All of the other answers appear to miss the fact that this functionality is already built into .NET, in the Encoder class. For bonus points, this approach will also work for other encodings.
public static string LimitByteLength(string message, int maxLength)
{
if (string.IsNullOrEmpty(message) || Encoding.UTF8.GetByteCount(message) <= maxLength)
{
return message;
}
var encoder = Encoding.UTF8.GetEncoder();
byte[] buffer = new byte[maxLength];
char[] messageChars = message.ToCharArray();
encoder.Convert(
chars: messageChars,
charIndex: 0,
charCount: messageChars.Length,
bytes: buffer,
byteIndex: 0,
byteCount: buffer.Length,
flush: false,
charsUsed: out int charsUsed,
bytesUsed: out int bytesUsed,
completed: out bool completed);
// I don't think we can return message.Substring(0, charsUsed)
// as that's the number of UTF-16 chars, not the number of codepoints
// (think about surrogate pairs). Therefore I think we need to
// actually convert bytes back into a new string
return Encoding.UTF8.GetString(buffer, 0, bytesUsed);
}
If you're using .NET Standard 2.1+, you can simplify it a bit:
public static string LimitByteLength(string message, int maxLength)
{
if (string.IsNullOrEmpty(message) || Encoding.UTF8.GetByteCount(message) <= maxLength)
{
return message;
}
var encoder = Encoding.UTF8.GetEncoder();
byte[] buffer = new byte[maxLength];
encoder.Convert(message.AsSpan(), buffer.AsSpan(), false, out _, out int bytesUsed, out _);
return Encoding.UTF8.GetString(buffer, 0, bytesUsed);
}
None of the other answers account for extended grapheme clusters, such as 👩🏽‍🚒. This is composed of 4 Unicode scalars (👩, 🏽, a zero-width joiner, and 🚒), so you need knowledge of the Unicode standard to avoid splitting it in the middle and producing 👩 or 👩🏽.
In .NET 5 onwards, you can write this as:
public static string LimitByteLength(string message, int maxLength)
{
if (string.IsNullOrEmpty(message) || Encoding.UTF8.GetByteCount(message) <= maxLength)
{
return message;
}
var enumerator = StringInfo.GetTextElementEnumerator(message);
var result = new StringBuilder();
int lengthBytes = 0;
while (enumerator.MoveNext())
{
lengthBytes += Encoding.UTF8.GetByteCount(enumerator.GetTextElement());
if (lengthBytes <= maxLength)
{
result.Append(enumerator.GetTextElement());
}
}
return result.ToString();
}
(This same code runs on earlier versions of .NET, but due to a bug it won't produce the correct result before .NET 5).
If a UTF-8 byte has a zero-valued high order bit, it's the beginning of a character. If its high order bit is 1, it's in the 'middle' of a character. The ability to detect the beginning of a character was an explicit design goal of UTF-8.
Check out the Description section of the wikipedia article for more detail.
Is there a reason that you need the database column to be declared in terms of bytes? That's the default, but it's not a particularly useful default if the database character set is variable width. I'd strongly prefer declaring the column in terms of characters.
CREATE TABLE length_example (
col1 VARCHAR2( 10 BYTE ),
col2 VARCHAR2( 10 CHAR )
);
This will create a table where COL1 will store 10 bytes of data and col2 will store 10 characters worth of data. Character length semantics make far more sense in a UTF8 database.
Assuming you want all the tables you create to use character length semantics by default, you can set the initialization parameter NLS_LENGTH_SEMANTICS to CHAR. At that point, any tables you create will default to using character length semantics rather than byte length semantics if you don't specify CHAR or BYTE in the field length.
Following Oren Trutner's comment here are two more solutions to the problem:
here we count the number of bytes to remove from the end of the string according to each character at the end of the string, so we don't evaluate the entire string in every iteration.
string str = "朣楢琴执执 瑩浻牡楧硰执执獧浻牡楧敬瑦 瀰 絸朣杢执獧扻捡杫潲湵 潣"
int maxBytesLength = 30;
var bytesArr = Encoding.UTF8.GetBytes(str);
int bytesToRemove = 0;
int lastIndexInString = str.Length -1;
while(bytesArr.Length - bytesToRemove > maxBytesLength)
{
bytesToRemove += Encoding.UTF8.GetByteCount(new char[] {str[lastIndexInString]} );
--lastIndexInString;
}
string trimmedString = Encoding.UTF8.GetString(bytesArr,0,bytesArr.Length - bytesToRemove);
//Encoding.UTF8.GetByteCount(trimmedString);//get the actual length, will be <= 朣楢琴执执 瑩浻牡楧硰执执獧浻牡楧敬瑦 瀰 絸朣杢执獧扻捡杫潲湵 潣潬昣昸昸慢正
And an even more efficient(and maintainable) solution:
get the string from the bytes array according to desired length and cut the last character because it might be corrupted
string str = "朣楢琴执执 瑩浻牡楧硰执执獧浻牡楧敬瑦 瀰 絸朣杢执獧扻捡杫潲湵 潣"
int maxBytesLength = 30;
string trimmedWithDirtyLastChar = Encoding.UTF8.GetString(Encoding.UTF8.GetBytes(str),0,maxBytesLength);
string trimmedString = trimmedWithDirtyLastChar.Substring(0,trimmedWithDirtyLastChar.Length - 1);
The only downside with the second solution is that we might cut a perfectly fine last character, but we are already cutting the string, so it might fit with the requirements.
Thanks to Shhade who thought about the second solution
This is another solution based on binary search:
public string LimitToUTF8ByteLength(string text, int size)
{
if (size <= 0)
{
return string.Empty;
}
int maxLength = text.Length;
int minLength = 0;
int length = maxLength;
while (maxLength >= minLength)
{
length = (maxLength + minLength) / 2;
int byteLength = Encoding.UTF8.GetByteCount(text.Substring(0, length));
if (byteLength > size)
{
maxLength = length - 1;
}
else if (byteLength < size)
{
minLength = length + 1;
}
else
{
return text.Substring(0, length);
}
}
// Round down the result
string result = text.Substring(0, length);
if (size >= Encoding.UTF8.GetByteCount(result))
{
return result;
}
else
{
return text.Substring(0, length - 1);
}
}
public static string LimitByteLength3(string input, Int32 maxLenth)
{
string result = input;
int byteCount = Encoding.UTF8.GetByteCount(input);
if (byteCount > maxLenth)
{
var byteArray = Encoding.UTF8.GetBytes(input);
result = Encoding.UTF8.GetString(byteArray, 0, maxLenth);
}
return result;
}

Categories