How to convert int to char[] without generating garbage in C# - c#

Doubtless this seems like a strange request, given the availability of ToString() and Convert.ToString(), but I need to convert an unsigned integer (i.e. UInt32) to its string representation, but I need to store the answer into a char[].
The reason is that I am working with character arrays for efficiency, and as the target char[] is initialised as a member to char[10] (to hold the string representation of UInt32.MaxValue) on object creation, it should be theoretically possible to do the conversion without generating any garbage (by which I mean without generating any temporary objects in the managed heap.)
Can anyone see a neat way to achieve this?
(I'm working in Framework 3.5SP1 in case that is any way relevant.)

Further to my comment above, I wondered if log10 was too slow, so I wrote a version that doesn't use it.
For four digit numbers this version is about 35% quicker, falling to about 16% quicker for ten digit numbers.
One disadvantage is that it requires space for the full ten digits in the buffer.
I don't swear it doesn't have any bugs!
public static int ToCharArray2(uint value, char[] buffer, int bufferIndex)
{
const int maxLength = 10;
if (value == 0)
{
buffer[bufferIndex] = '0';
return 1;
}
int startIndex = bufferIndex + maxLength - 1;
int index = startIndex;
do
{
buffer[index] = (char)('0' + value % 10);
value /= 10;
--index;
}
while (value != 0);
int length = startIndex - index;
if (bufferIndex != index + 1)
{
while (index != startIndex)
{
++index;
buffer[bufferIndex] = buffer[index];
++bufferIndex;
}
}
return length;
}
Update
I should add, I'm using a Pentium 4. More recent processors may calculate transcendental functions faster.
Conclusion
I realised yesterday that I'd made a schoolboy error and run the benchmarks on a debug build. So I ran them again but it didn't actually make much difference. The first column shows the number of digits in the number being converted. The remaining columns show the times in milliseconds to convert 500,000 numbers.
Results for uint:
luc1 arx henk1 luc3 henk2 luc2
1 715 217 966 242 837 244
2 877 420 1056 541 996 447
3 1059 608 1169 835 1040 610
4 1184 795 1282 1116 1162 801
5 1403 969 1405 1396 1279 978
6 1572 1149 1519 1674 1399 1170
7 1740 1335 1648 1952 1518 1352
8 1922 1675 1868 2233 1750 1545
9 2087 1791 2005 2511 1893 1720
10 2263 2103 2139 2797 2012 1985
Results for ulong:
luc1 arx henk1 luc3 henk2 luc2
1 802 280 998 390 856 317
2 912 516 1102 729 954 574
3 1066 746 1243 1060 1056 818
4 1300 1141 1362 1425 1170 1210
5 1557 1363 1503 1742 1306 1436
6 1801 1603 1612 2233 1413 1672
7 2269 1814 1723 2526 1530 1861
8 2208 2142 1920 2886 1634 2149
9 2360 2376 2063 3211 1775 2339
10 2615 2622 2213 3639 2011 2697
11 3048 2996 2513 4199 2244 3011
12 3413 3607 2507 4853 2326 3666
13 3848 3988 2663 5618 2478 4005
14 4298 4525 2748 6302 2558 4637
15 4813 5008 2974 7005 2712 5065
16 5161 5654 3350 7986 2994 5864
17 5997 6155 3241 8329 2999 5968
18 6490 6280 3296 8847 3127 6372
19 6440 6720 3557 9514 3386 6788
20 7045 6616 3790 10135 3703 7268
luc1: Lucero's first function
arx: my function
henk1: Henk's function
luc3 Lucero's third function
henk2: Henk's function without the copy to the char array; i.e. just test the performance of ToString().
luc2: Lucero's second function
The peculiar order is the order they were created in.
I also ran the test without henk1 and henk2 so there would be no garbage collection. The times for the other three functions were nearly identical. Once the benchmark had gone past three digits the memory use was stable: so GC was happening during Henk's functions and didn't have a detrimental effect on the other functions.
Conclusion: just call ToString()

The following code does it, with the following caveat: it does not respect the culture settings, but always outputs normal decimal digits.
public static int ToCharArray(uint value, char[] buffer, int bufferIndex) {
if (value == 0) {
buffer[bufferIndex] = '0';
return 1;
}
int len = (int)Math.Ceiling(Math.Log10(value));
for (int i = len-1; i>= 0; i--) {
buffer[bufferIndex+i] = (char)('0'+(value%10));
value /= 10;
}
return len;
}
The returned value is how much of the char[] has been used.
Edit (for arx): the following version avoids the floating-point math and swaps the buffer in-place:
public static int ToCharArray(uint value, char[] buffer, int bufferIndex) {
if (value == 0) {
buffer[bufferIndex] = '0';
return 1;
}
int bufferEndIndex = bufferIndex;
while (value > 0) {
buffer[bufferEndIndex++] = (char)('0'+(value%10));
value /= 10;
}
int len = bufferEndIndex-bufferIndex;
while (--bufferEndIndex > bufferIndex) {
char ch = buffer[bufferEndIndex];
buffer[bufferEndIndex] = buffer[bufferIndex];
buffer[bufferIndex++] = ch;
}
return len;
}
And here yet another variation which computes the number of digits in a small loop:
public static int ToCharArray(uint value, char[] buffer, int bufferIndex) {
if (value == 0) {
buffer[bufferIndex] = '0';
return 1;
}
int len = 1;
for (uint rem = value/10; rem > 0; rem /= 10) {
len++;
}
for (int i = len-1; i>= 0; i--) {
buffer[bufferIndex+i] = (char)('0'+(value%10));
value /= 10;
}
return len;
}
I leave the benchmarking to whoever wants to do it... ;)

I'm coming little late to the party, but I guess you probably cannot get faster and less memory demanding results than with simple reinterpreting of the memory:
[System.Security.SecuritySafeCritical]
public static unsafe char[] GetChars(int value, char[] chars)
{
//TODO: if needed to use accross machines then
// this should also use BitConverter.IsLittleEndian to detect little/big endian
// and order bytes appropriately
fixed (char* numPtr = chars)
*(int*)numPtr = value;
return chars;
}
[System.Security.SecuritySafeCritical]
public static unsafe int ToInt32(char[] value)
{
//TODO: if needed to use accross machines then
// this should also use BitConverter.IsLittleEndian to detect little/big endian
// and order bytes appropriately
fixed (char* numPtr = value)
return *(int*)numPtr;
}
This is just a demonstration of an idea - you'd obviously need to add check for char array size and make sure that you have proper byte-ordering encoding. You can peek into reflected helper methods of BitConverter for those checks.

Related

Binary of a number

Is there a simply way to convert decimal/ascii 6 bit decimal numbers from 1 to 100 to binary representation?
To be more specific im interested in 6 bit binary ascii. So I made this to get int 32.
For example "u" is changed to 61 instead 117 in standard decimal ascii.
Then this 61 is needed to be "111101" instead of traditional "01110101" but after this 48 + 8 math it's not important as now it's normal binary, just with 6 bits used.
foreach (char c in partToDecode)
{
var sum = c - 48;
if (sum>40)
{
sum = sum - 8;
}
Found this, but i don't have a clue how to traspose it to c#
void binary(unsigned n) {
unsigned i;
// Reverse loop
for (i = 1 << 31; i > 0; i >>= 1)
printf("%u", !!(n & i));
}
. . .
binary(65);
You can try Convert.ToString, e.g.
int source = 61;
// "111101"
string result = Convert.ToString(source, 2).PadLeft(6, '0');
Fiddle

Create random ints with minimum and maximum from Random.NextBytes()

Title pretty much says it all. I know I could use Random.NextInt(), of course, but I want to know if there's a way to turn unbounded random data into bounded without statistical bias. (This means no RandomInt() % (maximum-minimum)) + minimum). Surely there is a method like it, that doesn't introduce bias into the data it outputs?
If you assume that the bits are randomly distributed, I would suggest:
Generate enough bytes to get a number within the range (e.g. 1 byte to get a number in the range 0-100, 2 bytes to get a number in the range 0-30000 etc).
Use only enough bits from those bytes to cover the range you need. So for example, if you're generating numbers in the range 0-100, take the bottom 7 bits of the byte you've generated
Interpret the bits you've got as a number in the range [0, 2n) where n is the number of bit
Check whether the number is in your desired range. It should be at least half the time (on average)
If so, use it. If not, repeat the above steps until a number is in the right range.
The use of just the required number of bits is key to making this efficient - you'll throw away up to half the number of bytes you generate, but no more than that, assuming a good distribution. (And if you are generating numbers in a nicely binary range, you won't need to throw anything away.)
Implementation left as an exercise to the reader :)
You could try with something like:
public static int MyNextInt(Random rnd, int minValue, int maxValue)
{
var buffer = new byte[4];
rnd.NextBytes(buffer);
uint num = BitConverter.ToUInt32(buffer, 0);
// The +1 is to exclude the maxValue in the case that
// minValue == int.MinValue, maxValue == int.MaxValue
double dbl = num * 1.0 / ((long)uint.MaxValue + 1);
long range = (long)maxValue - minValue;
int result = (int)(dbl * range) + minValue;
return result;
}
Totally untested... I can't guarantee that the results are truly pseudo-random... But the idea of creating a double (dbl) number is the same used by the Random class. Only I use the uint.MaxValue as the base instead of int.MaxValue. In this way I don't have to check for negative values of the buffer.
I propose a generator of random integers, based on NextBytes.
This method discards only 9.62% of bits in average over the word size range for positive Int32's due to the usage of Int64 as a representation for bit manupulation.
Maximum bit loss occurs at word size of 22 bits, and it's 20 lost bits of 64 used in byte range conversion. In this case bit efficiency is 68.75%
Also, 25% of values are lost because of clipping the unbound range to maximum value.
Be careful to use Take(N) on the IEnumerable returned, because it's an infinite generator otherwise.
I'm using a buffer of 512 long values, so it generates 4096 random bytes at once. If you just need a sequence of few integers, change the buffer size from 512 to a more optimal value, down to 1.
public static class RandomExtensions
{
public static IEnumerable<int> GetRandomIntegers(this Random r, int max)
{
if (max < 1)
throw new ArgumentOutOfRangeException("max", max, "Must be a positive value.");
const int longWordsTotal = 512;
const int bufferSize = longWordsTotal * 8;
var buffer = new byte[bufferSize];
var wordSize = (int)Math.Log(max, 2) + 1;
while(true)
{
r.NextBytes(buffer);
for (var longWordIndex = 0; longWordIndex < longWordsTotal; longWordIndex++)
{
ulong longWord = BitConverter.ToUInt64(buffer, longWordIndex);
var lastStartBit = 64 - wordSize;
var count = 0;
for (var startBit = 0; startBit <= lastStartBit; startBit += wordSize)
{
count ++;
var mask = ((1UL << wordSize) - 1) << startBit;
var unboundValue = (int)((mask & longWord) >> startBit);
if (unboundValue <= max)
yield return unboundValue;
}
}
}
}
}

How do I properly loop through and print bits of an Int, Long, Float, or BigInteger?

I'm trying to debug some bit shifting operations and I need to visualize the bits as they exist before and after a Bit-Shifting operation.
I read from this answer that I may need to handle backfill from the shifting, but I'm not sure what that means.
I think that by asking this question (how do I print the bits in a int) I can figure out what the backfill is, and perhaps some other questions I have.
Here is my sample code so far.
static string GetBits(int num)
{
StringBuilder sb = new StringBuilder();
uint bits = (uint)num;
while (bits!=0)
{
bits >>= 1;
isBitSet = // somehow do an | operation on the first bit.
// I'm unsure if it's possible to handle different data types here
// or if unsafe code and a PTR is needed
if (isBitSet)
sb.Append("1");
else
sb.Append("0");
}
}
Convert.ToString(56,2).PadLeft(8,'0') returns "00111000"
This is for a byte, works for int also, just increase the numbers
To test if the last bit is set you could use:
isBitSet = ((bits & 1) == 1);
But you should do so before shifting right (not after), otherwise you's missing the first bit:
isBitSet = ((bits & 1) == 1);
bits = bits >> 1;
But a better option would be to use the static methods of the BitConverter class to get the actual bytes used to represent the number in memory into a byte array. The advantage (or disadvantage depending on your needs) of this method is that this reflects the endianness of the machine running the code.
byte[] bytes = BitConverter.GetBytes(num);
int bitPos = 0;
while(bitPos < 8 * bytes.Length)
{
int byteIndex = bitPos / 8;
int offset = bitPos % 8;
bool isSet = (bytes[byteIndex] & (1 << offset)) != 0;
// isSet = [True] if the bit at bitPos is set, false otherwise
bitPos++;
}

how to loop through the digits of a binary number?

I have a binary number 1011011, how can I loop through all these binary digits one after the other ?
I know how to do this for decimal integers by using modulo and division.
int n = 0x5b; // 1011011
Really you should just do this, hexadecimal in general is much better representation:
printf("%x", n); // this prints "5b"
To get it in binary, (with emphasis on easy understanding) try something like this:
printf("%s", "0b"); // common prefix to denote that binary follows
bool leading = true; // we're processing leading zeroes
// starting with the most significant bit to the least
for (int i = sizeof(n) * CHAR_BIT - 1; i >= 0; --i) {
int bit = (n >> i) & 1;
leading |= bit; // if the bit is 1, we are no longer reading leading zeroes
if (!leading)
printf("%d", bit);
}
if (leading) // all zero, so just print 0
printf("0");
// at this point, for n = 0x5b, we'll have printed 0b1011011
You can use modulo and division by 2 exactly like you would in base 10. You can also use binary operators, but if you already know how to do that in base 10, it would be easier if you just used division and modulo
Expanding on Frédéric and Gabi's answers, all you need to do is realise that the rules in base 2 are no different to in base 10 - you just need to do your division and modulus with a divisor 2 instead of 10.
The next step is simply to use number >> 1 instead of number / 2 and number & 0x1 instead of number % 2 to improve performance. Mind you, with modern optimising compilers there's probably no difference...
Use an AND with increasing powers of two...
In C, at least, you can do something like:
while (val != 0)
{
printf("%d", val&0x1);
val = val>>1;
}
To expand on #Marco's answer with an example:
uint value = 0x82fa9281;
for (int i = 0; i < 32; i++)
{
bool set = (value & 0x1) != 0;
value >>= 1;
Console.WriteLine("Bit set: {0}", set);
}
What this does is test the last bit, and then shift everything one bit.
If you're already starting with a string, you could just iterate through each of the characters in the string:
var values = "1011011".Reverse().ToCharArray();
for(var index = 0; index < values.Length; index++) {
var isSet = (Boolean)Int32.Parse(values[index]); // Boolean.Parse only works on "true"/"false", not 0/1
// do whatever
}
byte input = Convert.ToByte("1011011", 2);
BitArray arr = new BitArray(new[] { input });
foreach (bool value in arr)
{
// ...
}
You can simply loop through every bit. The following C like pseudocode allows you to set the bit number you want to check. (You might also want to google endianness)
for()
{
bitnumber = <your bit>
printf("%d",(val & 1<<bitnumber)?1:0);
}
The code basically writes 1 if the bit it set or 0 if not. We shift the value 1 (which in binary is 1 ;) ) the number of bits set in bitnumber and then we AND it with the value in val to see if it matches up. Simple as that!
So if bitnumber is 3 we simply do this
00000100 ( The value 1 is shifted 3 left for example)
AND
10110110 (We check it with whatever you're value is)
=
00000100 = True! - Both values have bit 3 set!

Best way to shorten UTF8 string based on byte length

A recent project called for importing data into an Oracle database. The program that will do this is a C# .Net 3.5 app and I'm using the Oracle.DataAccess connection library to handle the actual inserting.
I ran into a problem where I'd receive this error message when inserting a particular field:
ORA-12899 Value too large for column X
I used Field.Substring(0, MaxLength); but still got the error (though not for every record).
Finally I saw what should have been obvious, my string was in ANSI and the field was UTF8. Its length is defined in bytes, not characters.
This gets me to my question. What is the best way to trim my string to fix the MaxLength?
My substring code works by character length. Is there simple C# function that can trim a UT8 string intelligently by byte length (ie not hack off half a character) ?
I think we can do better than naively counting the total length of a string with each addition. LINQ is cool, but it can accidentally encourage inefficient code. What if I wanted the first 80,000 bytes of a giant UTF string? That's a lot of unnecessary counting. "I've got 1 byte. Now I've got 2. Now I've got 13... Now I have 52,384..."
That's silly. Most of the time, at least in l'anglais, we can cut exactly on that nth byte. Even in another language, we're less than 6 bytes away from a good cutting point.
So I'm going to start from #Oren's suggestion, which is to key off of the leading bit of a UTF8 char value. Let's start by cutting right at the n+1th byte, and use Oren's trick to figure out if we need to cut a few bytes earlier.
Three possibilities
If the first byte after the cut has a 0 in the leading bit, I know I'm cutting precisely before a single byte (conventional ASCII) character, and can cut cleanly.
If I have a 11 following the cut, the next byte after the cut is the start of a multi-byte character, so that's a good place to cut too!
If I have a 10, however, I know I'm in the middle of a multi-byte character, and need to go back to check to see where it really starts.
That is, though I want to cut the string after the nth byte, if that n+1th byte comes in the middle of a multi-byte character, cutting would create an invalid UTF8 value. I need to back up until I get to one that starts with 11 and cut just before it.
Code
Notes: I'm using stuff like Convert.ToByte("11000000", 2) so that it's easy to tell what bits I'm masking (a little more about bit masking here). In a nutshell, I'm &ing to return what's in the byte's first two bits and bringing back 0s for the rest. Then I check the XX from XX000000 to see if it's 10 or 11, where appropriate.
I found out today that C# 6.0 might actually support binary representations, which is cool, but we'll keep using this kludge for now to illustrate what's going on.
The PadLeft is just because I'm overly OCD about output to the Console.
So here's a function that'll cut you down to a string that's n bytes long or the greatest number less than n that's ends with a "complete" UTF8 character.
public static string CutToUTF8Length(string str, int byteLength)
{
byte[] byteArray = Encoding.UTF8.GetBytes(str);
string returnValue = string.Empty;
if (byteArray.Length > byteLength)
{
int bytePointer = byteLength;
// Check high bit to see if we're [potentially] in the middle of a multi-byte char
if (bytePointer >= 0
&& (byteArray[bytePointer] & Convert.ToByte("10000000", 2)) > 0)
{
// If so, keep walking back until we have a byte starting with `11`,
// which means the first byte of a multi-byte UTF8 character.
while (bytePointer >= 0
&& Convert.ToByte("11000000", 2) != (byteArray[bytePointer] & Convert.ToByte("11000000", 2)))
{
bytePointer--;
}
}
// See if we had 1s in the high bit all the way back. If so, we're toast. Return empty string.
if (0 != bytePointer)
{
returnValue = Encoding.UTF8.GetString(byteArray, 0, bytePointer); // hat tip to #NealEhardt! Well played. ;^)
}
}
else
{
returnValue = str;
}
return returnValue;
}
I initially wrote this as a string extension. Just add back the this before string str to put it back into extension format, of course. I removed the this so that we could just slap the method into Program.cs in a simple console app to demonstrate.
Test and expected output
Here's a good test case, with the output it create below, written expecting to be the Main method in a simple console app's Program.cs.
static void Main(string[] args)
{
string testValue = "12345“”67890”";
for (int i = 0; i < 15; i++)
{
string cutValue = Program.CutToUTF8Length(testValue, i);
Console.WriteLine(i.ToString().PadLeft(2) +
": " + Encoding.UTF8.GetByteCount(cutValue).ToString().PadLeft(2) +
":: " + cutValue);
}
Console.WriteLine();
Console.WriteLine();
foreach (byte b in Encoding.UTF8.GetBytes(testValue))
{
Console.WriteLine(b.ToString().PadLeft(3) + " " + (char)b);
}
Console.WriteLine("Return to end.");
Console.ReadLine();
}
Output follows. Notice that the "smart quotes" in testValue are three bytes long in UTF8 (though when we write the chars to the console in ASCII, it outputs dumb quotes). Also note the ?s output for the second and third bytes of each smart quote in the output.
The first five characters of our testValue are single bytes in UTF8, so 0-5 byte values should be 0-5 characters. Then we have a three-byte smart quote, which can't be included in its entirety until 5 + 3 bytes. Sure enough, we see that pop out at the call for 8.Our next smart quote pops out at 8 + 3 = 11, and then we're back to single byte characters through 14.
0: 0::
1: 1:: 1
2: 2:: 12
3: 3:: 123
4: 4:: 1234
5: 5:: 12345
6: 5:: 12345
7: 5:: 12345
8: 8:: 12345"
9: 8:: 12345"
10: 8:: 12345"
11: 11:: 12345""
12: 12:: 12345""6
13: 13:: 12345""67
14: 14:: 12345""678
49 1
50 2
51 3
52 4
53 5
226 â
128 ?
156 ?
226 â
128 ?
157 ?
54 6
55 7
56 8
57 9
48 0
226 â
128 ?
157 ?
Return to end.
So that's kind of fun, and I'm in just before the question's five year anniversary. Though Oren's description of the bits had a small error, that's exactly the trick you want to use. Thanks for the question; neat.
Here are two possible solution - a LINQ one-liner processing the input left to right and a traditional for-loop processing the input from right to left. Which processing direction is faster depends on the string length, the allowed byte length, and the number and distribution of multibyte characters and is hard to give a general suggestion. The decision between LINQ and traditional code I probably a matter of taste (or maybe speed).
If speed matters, one could think about just accumulating the byte length of each character until reaching the maximum length instead of calculating the byte length of the whole string in each iteration. But I am not sure if this will work because I don't know UTF-8 encoding well enough. I could theoreticaly imagine that the byte length of a string does not equal the sum of the byte lengths of all characters.
public static String LimitByteLength(String input, Int32 maxLength)
{
return new String(input
.TakeWhile((c, i) =>
Encoding.UTF8.GetByteCount(input.Substring(0, i + 1)) <= maxLength)
.ToArray());
}
public static String LimitByteLength2(String input, Int32 maxLength)
{
for (Int32 i = input.Length - 1; i >= 0; i--)
{
if (Encoding.UTF8.GetByteCount(input.Substring(0, i + 1)) <= maxLength)
{
return input.Substring(0, i + 1);
}
}
return String.Empty;
}
Shorter version of ruffin's answer. Takes advantage of the design of UTF8:
public static string LimitUtf8ByteCount(this string s, int n)
{
// quick test (we probably won't be trimming most of the time)
if (Encoding.UTF8.GetByteCount(s) <= n)
return s;
// get the bytes
var a = Encoding.UTF8.GetBytes(s);
// if we are in the middle of a character (highest two bits are 10)
if (n > 0 && ( a[n]&0xC0 ) == 0x80)
{
// remove all bytes whose two highest bits are 10
// and one more (start of multi-byte sequence - highest bits should be 11)
while (--n > 0 && ( a[n]&0xC0 ) == 0x80)
;
}
// convert back to string (with the limit adjusted)
return Encoding.UTF8.GetString(a, 0, n);
}
All of the other answers appear to miss the fact that this functionality is already built into .NET, in the Encoder class. For bonus points, this approach will also work for other encodings.
public static string LimitByteLength(string message, int maxLength)
{
if (string.IsNullOrEmpty(message) || Encoding.UTF8.GetByteCount(message) <= maxLength)
{
return message;
}
var encoder = Encoding.UTF8.GetEncoder();
byte[] buffer = new byte[maxLength];
char[] messageChars = message.ToCharArray();
encoder.Convert(
chars: messageChars,
charIndex: 0,
charCount: messageChars.Length,
bytes: buffer,
byteIndex: 0,
byteCount: buffer.Length,
flush: false,
charsUsed: out int charsUsed,
bytesUsed: out int bytesUsed,
completed: out bool completed);
// I don't think we can return message.Substring(0, charsUsed)
// as that's the number of UTF-16 chars, not the number of codepoints
// (think about surrogate pairs). Therefore I think we need to
// actually convert bytes back into a new string
return Encoding.UTF8.GetString(buffer, 0, bytesUsed);
}
If you're using .NET Standard 2.1+, you can simplify it a bit:
public static string LimitByteLength(string message, int maxLength)
{
if (string.IsNullOrEmpty(message) || Encoding.UTF8.GetByteCount(message) <= maxLength)
{
return message;
}
var encoder = Encoding.UTF8.GetEncoder();
byte[] buffer = new byte[maxLength];
encoder.Convert(message.AsSpan(), buffer.AsSpan(), false, out _, out int bytesUsed, out _);
return Encoding.UTF8.GetString(buffer, 0, bytesUsed);
}
None of the other answers account for extended grapheme clusters, such as 👩🏽‍🚒. This is composed of 4 Unicode scalars (👩, 🏽, a zero-width joiner, and 🚒), so you need knowledge of the Unicode standard to avoid splitting it in the middle and producing 👩 or 👩🏽.
In .NET 5 onwards, you can write this as:
public static string LimitByteLength(string message, int maxLength)
{
if (string.IsNullOrEmpty(message) || Encoding.UTF8.GetByteCount(message) <= maxLength)
{
return message;
}
var enumerator = StringInfo.GetTextElementEnumerator(message);
var result = new StringBuilder();
int lengthBytes = 0;
while (enumerator.MoveNext())
{
lengthBytes += Encoding.UTF8.GetByteCount(enumerator.GetTextElement());
if (lengthBytes <= maxLength)
{
result.Append(enumerator.GetTextElement());
}
}
return result.ToString();
}
(This same code runs on earlier versions of .NET, but due to a bug it won't produce the correct result before .NET 5).
If a UTF-8 byte has a zero-valued high order bit, it's the beginning of a character. If its high order bit is 1, it's in the 'middle' of a character. The ability to detect the beginning of a character was an explicit design goal of UTF-8.
Check out the Description section of the wikipedia article for more detail.
Is there a reason that you need the database column to be declared in terms of bytes? That's the default, but it's not a particularly useful default if the database character set is variable width. I'd strongly prefer declaring the column in terms of characters.
CREATE TABLE length_example (
col1 VARCHAR2( 10 BYTE ),
col2 VARCHAR2( 10 CHAR )
);
This will create a table where COL1 will store 10 bytes of data and col2 will store 10 characters worth of data. Character length semantics make far more sense in a UTF8 database.
Assuming you want all the tables you create to use character length semantics by default, you can set the initialization parameter NLS_LENGTH_SEMANTICS to CHAR. At that point, any tables you create will default to using character length semantics rather than byte length semantics if you don't specify CHAR or BYTE in the field length.
Following Oren Trutner's comment here are two more solutions to the problem:
here we count the number of bytes to remove from the end of the string according to each character at the end of the string, so we don't evaluate the entire string in every iteration.
string str = "朣楢琴执执 瑩浻牡楧硰执执獧浻牡楧敬瑦 瀰 絸朣杢执獧扻捡杫潲湵 潣"
int maxBytesLength = 30;
var bytesArr = Encoding.UTF8.GetBytes(str);
int bytesToRemove = 0;
int lastIndexInString = str.Length -1;
while(bytesArr.Length - bytesToRemove > maxBytesLength)
{
bytesToRemove += Encoding.UTF8.GetByteCount(new char[] {str[lastIndexInString]} );
--lastIndexInString;
}
string trimmedString = Encoding.UTF8.GetString(bytesArr,0,bytesArr.Length - bytesToRemove);
//Encoding.UTF8.GetByteCount(trimmedString);//get the actual length, will be <= 朣楢琴执执 瑩浻牡楧硰执执獧浻牡楧敬瑦 瀰 絸朣杢执獧扻捡杫潲湵 潣潬昣昸昸慢正
And an even more efficient(and maintainable) solution:
get the string from the bytes array according to desired length and cut the last character because it might be corrupted
string str = "朣楢琴执执 瑩浻牡楧硰执执獧浻牡楧敬瑦 瀰 絸朣杢执獧扻捡杫潲湵 潣"
int maxBytesLength = 30;
string trimmedWithDirtyLastChar = Encoding.UTF8.GetString(Encoding.UTF8.GetBytes(str),0,maxBytesLength);
string trimmedString = trimmedWithDirtyLastChar.Substring(0,trimmedWithDirtyLastChar.Length - 1);
The only downside with the second solution is that we might cut a perfectly fine last character, but we are already cutting the string, so it might fit with the requirements.
Thanks to Shhade who thought about the second solution
This is another solution based on binary search:
public string LimitToUTF8ByteLength(string text, int size)
{
if (size <= 0)
{
return string.Empty;
}
int maxLength = text.Length;
int minLength = 0;
int length = maxLength;
while (maxLength >= minLength)
{
length = (maxLength + minLength) / 2;
int byteLength = Encoding.UTF8.GetByteCount(text.Substring(0, length));
if (byteLength > size)
{
maxLength = length - 1;
}
else if (byteLength < size)
{
minLength = length + 1;
}
else
{
return text.Substring(0, length);
}
}
// Round down the result
string result = text.Substring(0, length);
if (size >= Encoding.UTF8.GetByteCount(result))
{
return result;
}
else
{
return text.Substring(0, length - 1);
}
}
public static string LimitByteLength3(string input, Int32 maxLenth)
{
string result = input;
int byteCount = Encoding.UTF8.GetByteCount(input);
if (byteCount > maxLenth)
{
var byteArray = Encoding.UTF8.GetBytes(input);
result = Encoding.UTF8.GetString(byteArray, 0, maxLenth);
}
return result;
}

Categories