ToString("X") produces single digit hex numbers - c#

We wrote a crude data scope.
(The freeware terminal programs we found were unable to keep up with Bluetooth speeds)
The results are okay, and we are writing them to a Comma separated file for use with a spreadsheet. It would be better to see the hex values line up in nice columns in the RichTextBox instead of the way it looks now (Screen cap appended).
This is the routine that adds the digits (e.g., numbers from 0 to FF) to the text in the RichTextBox.
public void Write(byte[] b)
{
if (writting)
{
for (int i = 0; i < b.Length; i++)
{
storage[sPlace++] = b[i];
pass += b[i].ToString("X") + " "; //// <<<--- Here is the problem
if (sPlace % numericUpDown1.Value == 0)
{
pass += "\r\n";
}
}
}
}
I would like a way for the instruction pass += b[i].ToString("X") + " "; to produce a leading zero on values from 00h to 0Fh
Or, some other way to turn the value in byte b into two alphabetic characters from 00 to FF
Digits on left, FF 40 0 5 Line up nice and neatly, because they are identical. As soon as we encounter any difference in data, the columns vanish and the data become extremely difficult to read with human observation.

Use a composite format string:
pass += b[i].ToString("X2") + " ";
The documentation on MSDN, Standard Numeric Format Strings has examples.

Related

How to divide long text into bytes c#

I'm already really newbie in coding but my problem is how to divide code likt this "7900BD7400BD7500BD76A5FF" to this "79 00 BD 74 00 BD 75 00 BD 76 A5 FF". My main problem was to convert hex into ascii, but any solution which i got convert only "short" expression. Maybe someone can give me some advices? I'll be really gratefull
A more general solution to the problem:
static String SeparateBy(
this string str,
string separator,
int groupLength)
{
var buffer = new StringBuilder();
for (var i = 0; i < str.Length; i++)
{
if (i % groupLength == 0)
{
buffer.Append(separator);
}
buffer.Append(str[i]);
}
return buffer.ToString();
}
And you'd call it like: "7900BD7400BD7500BD76A5FF".SeparateBy(" ", 2)
When posible, and if its relatively easy, try to generalize methods so they can be reused more often. Of course if things start to get complicated generalizing can be self defeating… knowing when or when not to generalize is a skill you will acquire little by little.
Since you don't seem to have much knowledge in string processing, I'll give an example that does not require you to lern too many things at once:
string input = "7900BD7400BD7500BD76A5FF";
string output = string.Empty;
for(int i=0; i<input.Length; i+=2) // Go in steps of 2
{
output += input[i]; // The first of 2 characters
output += input[i+1]; // The second of 2 characters
output += ' '; // The space
}
Console.WriteLine(output);
Please note that this solution only inserts spaces after every second character. It does not check whether these are all hex codes and whether its length is a multiple of 2. It assumes that whatever code is before generates a valid result. You should ensure that with unit tests.
This approach will not be very efficient for long strings (if you have long text, please learn about StringBuilder).
If you followed this advice for creating the hex data, then it's of course much easier to insert the space right away:
public static string ByteArrayToString(byte[] ba)
{
StringBuilder hex = new StringBuilder(ba.Length * 2);
foreach (byte b in ba)
hex.AppendFormat("{0:X2} ", b); // <-- I inserted a space in the format string
return hex.ToString();
}

How to convert a base 10-number to 3-base in net (Special case)

I'm looking for a routine in C# that gives me the following output when putting in numbers:
0 - A
1 - B
2 - C
3 - AA
4 - AB
5 - AC
6 - BA
7 - BB
8 - BC
9 - CA
10 - CB
11 - CC
12 - AAA
13 - etc
I'm using letters, so that it's not so confusing with zero's.
I've seen other routines, but they will give me BA for the value of 3 and not AA.
Note: The other routines I found was Quickest way to convert a base 10 number to any base in .NET? and http://www.drdobbs.com/architecture-and-design/convert-from-long-to-any-number-base-in/228701167, but as I said, they would give me not exactly what I was looking for.
Converting between systems is basic programming task and logic doesn't differ from other systems (such as hexadecimal or binary). Please, find below code:
//here you choose what number should be used to convert, you wanted 3, so I assigned this value here
int systemNumber = 3;
//pick number to convert (you can feed text box value here)
int numberToParse = 5;
// Note below
numberToParse++;
string convertedNumber = "";
List<char> letters = new List<char>{ 'A', 'B', 'C' };
//basic algorithm for converting numbers between systems
while(numberToParse > 0)
{
// Note below
numberToParse--;
convertedNumber = letters[numberToParse % systemNumber] + convertedNumber;
//append corresponding letter to our "number"
numberToParse = (int)Math.Floor((decimal)numberToParse / systemNumber);
}
//show converted number
MessageBox.Show(convertedNumber);
NOTE: I didn't read carefully at first and got it wrong. I added to previous solution two lines marked with "Note below": incrementation and decrementation of parsed number. Decrementation enables A (which is zero, thus omitted at the beginning of numbers) to be treated as relevent leading digit. But this way, numbers that can be converted are shifted and begin with 1. To compensate that, we need to increment our number at the beginning.
Additionaly, if you want to use other systems like that, you have to expand list with letter. Now we have A, B and C, because you wanted system based on 3. In fact, you can always use full alphabet:
List<char> letters = new List<char> {'A','B','C', 'D','E','F','G','H','I','J','K','L','M','N','O','P','Q','R','S','T','U','V','W','X','Y','Z'};
and only change systemNumber.
Based on code from https://stackoverflow.com/a/182924 the following should work:
private string GetWeirdBase3Value(int input)
{
int dividend = input+1;
string output = String.Empty;
int modulo;
while (dividend > 0)
{
modulo = (dividend - 1) % 3;
output = Convert.ToChar('A' + modulo).ToString() + output;
dividend = (int)((dividend - modulo) / 3);
}
return output;
}
The code should hopefully be pretty easy to read. It essentially iteratively calculates character by character until the dividend is reduced to 0.

adler-32 algorithm example Clarity would be great

I am having a hard time understanding what exactly is going on behind this algorithm. So, I have the following code which I believe works for the Wikipedia example. I seem to be having problems matching up the correct outcomes of hex values. While for the wiki example I get the correct hex value, It seems that my int finalValue; is not the correct value.
string fText, fileName, output;
Int32 a = 1 , b = 0;
const int MOD_ADLER = 65521;
const int ADLER_CONST2 = 65536;
private void btnCalculate_Click(object sender, EventArgs e) {
fileName = tbFilePath.Text;
if(fileName != "" && File.Exists(fileName)) {
fText = File.ReadAllText(fileName);
foreach (char i in fText) {
a = ( a + Convert.ToInt32(i)) % MOD_ADLER;
b = (b + a) % MOD_ADLER;
}
int finalValue = (b * ADLER_CONST2 + a);
output = finalValue.ToString("X");
lbValue.Text = output.ToString();
}
else {
MessageBox.Show("This is not a valid filepath, or is a blank file.\n" +
"Please enter a valid file path.");
}
}
I understand that this is not an efficient way to go about this, I am just trying to understand what is really going on under the hood. That way I can create a more efficient algorithm that varies from this.
From my understanding. In my code, the example value a is going to be added the integer (32 bit) value plus its initial value of 1. I do the Mod of the very high prime number, and continue moving through the sub-string of my text file adding up the values until all of the characters have been added up.
Probably this two lines confuse you.
a = ( a + Convert.ToInt32(i)) % MOD_ADLER;
b = (b + a) % MOD_ADLER;
Every char have integer representation. You can check this article. You are changing the value a to be the reminder-> from current value of a + int representetion of the char divided by MOD_ADLER. You can read operator %
What is reminder: 5%2 = 1
After that you are making same thing for b. b is equal to the reminder current value of b+a divided by MOD_ADLER. After you do that multiple times ( number of chars in the string). You have this.
int finalValue = (b * ADLER_CONST2 + a);
output = finalValue.ToString("X");
This converts the final integer value to HEX.
output = finalValue.ToString("X");
The "X" format says generate the hexadecimal represent of the number!
See MSDN Standard Numeric Format Strings

Class Vs Pure Array Representation

We need to represent huge numbers in our application. We're doing this using integer arrays. The final production should be maxed for performance. We were thinking about encapsulating our array in a class so we could add properties to be related to the array such as isNegative, numberBase and alike.
We're afraid that using classes, however, will kill us performance wise. We did a test where we created a fixed amount of arrays and set it's value through pure arrays usage and where a class was created and the array accessed through the class:
for (int i = 0; i < 10000; i++)
{
if (createClass)
{
BigNumber b = new BigNumber(new int[5000], 10);
for (int j = 0; j < b.Number.Length; j++)
{
b[j] = 5;
}
}
else
{
int[] test = new int[5000];
for (int j = 0; j < test.Length; j++)
{
test[j] = 5;
}
}
}
And it appears that using classes slows down the runnign time of the above code by a factor 6 almost. We tried the above just by encapsulating the array in a struct instead which caused the running time to be almost equal to pure array usage.
What is causing this huge overhead when using classes compared to structs? Is it really just the performance gain you get when you use the stack instead of the heap?
BigNumber just stores the array in a private variable exposed by a property. Simplified:
public class BigNumber{
private int[] number;
public BigNumber(int[] number) { this.number = number;}
public int[] Number{get{return number;}}
}
It's not surprising that the second loop is much faster than the first one. What's happening is not that the class is extraordinarily slow, it's that the loop is really easy for the compiler to optimize.
As the loop ranges from 0 to test.Length-1, the compiler can tell that the index variable can never be outside of the array, so it can remove the range check when accessing the array by index.
In the first loop the compiler can't do the connection between the loop and the array, so it has to check the index against the boundaries for each item that is accessed.
There will always be a bit of overhead when you encapsulate an array inside a class, but it's not as much as the difference that you get in your test. You have chosen a situation where the compiler is able to optimize the plain array access very well, so what you are testing is more the compilers capability to optimize the code rather than what you set out to test.
You should profile the code when you run it and see where the time is being spent.
Also consider another language that makes it easy to use big ints.
You're using an integer data type to store a single digit, which is part of a really large number. This is wrong
The numerals 0-9 can be represented in 4 bits. A byte contains 8 bits. So, you can stuff 2 digits into a single byte (there's your first speed up hint).
Now, go look up how many bytes an integer is taking up (hint: it will be way more than you need to store a single digit).
What's killing performance is the use of integers, which is consuming about 4 times as much memory as you should be. Use bytes or, worst case, a character array (2 digits per byte or character) to store the numerals. It doesn't take a whole lot of logic to "pack" and "unpack" the numerals into a byte.
From the face of it, I would not expect a big difference. Certainly not a factor 6. BigNumber is just a class around an int[] isn't it? It would help if you showed us a little from BigNumber. And check your benchmarking.
It would be ideal if you posted something small we could copy/paste and run.
Without seeing your BigInteger implementation, it is very difficult to tell. However, I have two guesses.
1) Your line, with the array test, can get special handling by the JIT which removes the array bounds checking. This can give you a significant boost, especially since you're not doing any "real work" in the loop
for (int j = 0; j < test.Length; j++) // This removes bounds checking by JIT
2) Are you timing this in release mode, outside of Visual Studio? If not, that, alone, would explain your 6x speed drop, since the visual studio hosting process slows down class access artificially. Make sure you're in release mode, using Ctrl+F5 to test your timings.
Rather than reinventing (and debugging and perfecting) the wheel, you might be better served using an existing big integer implementation, so you can get on with the rest of your project.
This SO topic is a good start.
You might also check out this CodeProject article.
As pointed out by Guffa, the difference is mostly bounds checking.
To guarantee that bounds checking will not ruin performance, you can also put your tight loops in an unsafe block and this will eliminate bounds checking. To do this you'll need to compile with the /unsafe option.
//pre-load the bits -- do this only ONCE
byte[] baHi = new byte[16];
baHi[0]=0;
baHi[1] = 000 + 00 + 00 + 16; //0001
baHi[2] = 000 + 00 + 32 + 00; //0010
baHi[3] = 000 + 00 + 32 + 16; //0011
baHi[4] = 000 + 64 + 00 + 00; //0100
baHi[5] = 000 + 64 + 00 + 16; //0101
baHi[6] = 000 + 64 + 32 + 00; //0110
baHi[7] = 000 + 64 + 32 + 16; //0111
baHi[8] = 128 + 00 + 00 + 00; //1000
baHi[9] = 128 + 00 + 00 + 16; //1001
//not needed for 0-9
//baHi[10] = 128 + 00 + 32 + 00; //1010
//baHi[11] = 128 + 00 + 32 + 16; //1011
//baHi[12] = 128 + 64 + 00 + 00; //1100
//baHi[13] = 128 + 64 + 00 + 16; //1101
//baHi[14] = 128 + 64 + 32 + 00; //1110
//baHi[15] = 128 + 64 + 32 + 16; //1111
//-------------------------------------------------------------------------
//START PACKING
//load TWO digits (0-9) at a time
//this means if you're loading a big number from
//a file, you read two digits at a time
//and put them into bLoVal and bHiVal
//230942034371231235 see that '37' in the middle?
// ^^
//
byte bHiVal = 3; //0000 0011
byte bLoVal = 7; //0000 1011
byte bShiftedLeftHiVal = (byte)baHi[bHiVal]; //0011 0000 =3, shifted (48)
//fuse the two together into a single byte
byte bNewVal = (byte)(bShiftedLeftHiVal + bLoVal); //0011 1011 = 55 decimal
//now store bNewVal wherever you want to store it
//for later retrieval, like a byte array
//END PACKING
//-------------------------------------------------------------------------
Response.Write("PACKING: hi: " + bHiVal + " lo: " + bLoVal + " packed: " + bNewVal);
Response.Write("<br>");
//-------------------------------------------------------------------------
//START UNPACKING
byte bUnpackedLoByte = (byte)(bNewVal & 15); //will yield 7
byte bUnpackedHiByte = (byte)(bNewVal & 240); //will yield 48
//now we need to change '48' back into '3'
string sHiBits = "00000000" + Convert.ToString(bUnpackedHiByte, 2); //drops leading 0s, so we pad...
sHiBits = sHiBits.Substring(sHiBits.Length - 8, 8); //and get the last 8 characters
sHiBits = ("0000" + sHiBits).Substring(0, 8); //shift right
bUnpackedHiByte = (byte)Convert.ToSByte(sHiBits, 2); //and, finally, get back the original byte
//the above method, reworked, could also be used to PACK the data,
//though it might be slower than hitting an array.
//You can also loop through baHi to unpack, comparing the original
//bUnpackedHyByte to the contents of the array and return
//the index of where you found it (the index would be the
//unpacked digit)
Response.Write("UNPACKING: input: " + bNewVal + " hi: " + bUnpackedHiByte + " lo: " + bUnpackedLoByte);
//now create your output with bUnpackedHiByte and bUnpackedLoByte,
//then move on to the next two bytes in where ever you stored the
//really big number
//END UNPACKING
//-------------------------------------------------------------------------
Even if you just changed your INT to SHORT in your original solution you'd chop your memory requirements in half, the above takes memory down to almost a bare minimum (I'm sure someone will come along screaming about a few wasted bytes)

Best way to shorten UTF8 string based on byte length

A recent project called for importing data into an Oracle database. The program that will do this is a C# .Net 3.5 app and I'm using the Oracle.DataAccess connection library to handle the actual inserting.
I ran into a problem where I'd receive this error message when inserting a particular field:
ORA-12899 Value too large for column X
I used Field.Substring(0, MaxLength); but still got the error (though not for every record).
Finally I saw what should have been obvious, my string was in ANSI and the field was UTF8. Its length is defined in bytes, not characters.
This gets me to my question. What is the best way to trim my string to fix the MaxLength?
My substring code works by character length. Is there simple C# function that can trim a UT8 string intelligently by byte length (ie not hack off half a character) ?
I think we can do better than naively counting the total length of a string with each addition. LINQ is cool, but it can accidentally encourage inefficient code. What if I wanted the first 80,000 bytes of a giant UTF string? That's a lot of unnecessary counting. "I've got 1 byte. Now I've got 2. Now I've got 13... Now I have 52,384..."
That's silly. Most of the time, at least in l'anglais, we can cut exactly on that nth byte. Even in another language, we're less than 6 bytes away from a good cutting point.
So I'm going to start from #Oren's suggestion, which is to key off of the leading bit of a UTF8 char value. Let's start by cutting right at the n+1th byte, and use Oren's trick to figure out if we need to cut a few bytes earlier.
Three possibilities
If the first byte after the cut has a 0 in the leading bit, I know I'm cutting precisely before a single byte (conventional ASCII) character, and can cut cleanly.
If I have a 11 following the cut, the next byte after the cut is the start of a multi-byte character, so that's a good place to cut too!
If I have a 10, however, I know I'm in the middle of a multi-byte character, and need to go back to check to see where it really starts.
That is, though I want to cut the string after the nth byte, if that n+1th byte comes in the middle of a multi-byte character, cutting would create an invalid UTF8 value. I need to back up until I get to one that starts with 11 and cut just before it.
Code
Notes: I'm using stuff like Convert.ToByte("11000000", 2) so that it's easy to tell what bits I'm masking (a little more about bit masking here). In a nutshell, I'm &ing to return what's in the byte's first two bits and bringing back 0s for the rest. Then I check the XX from XX000000 to see if it's 10 or 11, where appropriate.
I found out today that C# 6.0 might actually support binary representations, which is cool, but we'll keep using this kludge for now to illustrate what's going on.
The PadLeft is just because I'm overly OCD about output to the Console.
So here's a function that'll cut you down to a string that's n bytes long or the greatest number less than n that's ends with a "complete" UTF8 character.
public static string CutToUTF8Length(string str, int byteLength)
{
byte[] byteArray = Encoding.UTF8.GetBytes(str);
string returnValue = string.Empty;
if (byteArray.Length > byteLength)
{
int bytePointer = byteLength;
// Check high bit to see if we're [potentially] in the middle of a multi-byte char
if (bytePointer >= 0
&& (byteArray[bytePointer] & Convert.ToByte("10000000", 2)) > 0)
{
// If so, keep walking back until we have a byte starting with `11`,
// which means the first byte of a multi-byte UTF8 character.
while (bytePointer >= 0
&& Convert.ToByte("11000000", 2) != (byteArray[bytePointer] & Convert.ToByte("11000000", 2)))
{
bytePointer--;
}
}
// See if we had 1s in the high bit all the way back. If so, we're toast. Return empty string.
if (0 != bytePointer)
{
returnValue = Encoding.UTF8.GetString(byteArray, 0, bytePointer); // hat tip to #NealEhardt! Well played. ;^)
}
}
else
{
returnValue = str;
}
return returnValue;
}
I initially wrote this as a string extension. Just add back the this before string str to put it back into extension format, of course. I removed the this so that we could just slap the method into Program.cs in a simple console app to demonstrate.
Test and expected output
Here's a good test case, with the output it create below, written expecting to be the Main method in a simple console app's Program.cs.
static void Main(string[] args)
{
string testValue = "12345“”67890”";
for (int i = 0; i < 15; i++)
{
string cutValue = Program.CutToUTF8Length(testValue, i);
Console.WriteLine(i.ToString().PadLeft(2) +
": " + Encoding.UTF8.GetByteCount(cutValue).ToString().PadLeft(2) +
":: " + cutValue);
}
Console.WriteLine();
Console.WriteLine();
foreach (byte b in Encoding.UTF8.GetBytes(testValue))
{
Console.WriteLine(b.ToString().PadLeft(3) + " " + (char)b);
}
Console.WriteLine("Return to end.");
Console.ReadLine();
}
Output follows. Notice that the "smart quotes" in testValue are three bytes long in UTF8 (though when we write the chars to the console in ASCII, it outputs dumb quotes). Also note the ?s output for the second and third bytes of each smart quote in the output.
The first five characters of our testValue are single bytes in UTF8, so 0-5 byte values should be 0-5 characters. Then we have a three-byte smart quote, which can't be included in its entirety until 5 + 3 bytes. Sure enough, we see that pop out at the call for 8.Our next smart quote pops out at 8 + 3 = 11, and then we're back to single byte characters through 14.
0: 0::
1: 1:: 1
2: 2:: 12
3: 3:: 123
4: 4:: 1234
5: 5:: 12345
6: 5:: 12345
7: 5:: 12345
8: 8:: 12345"
9: 8:: 12345"
10: 8:: 12345"
11: 11:: 12345""
12: 12:: 12345""6
13: 13:: 12345""67
14: 14:: 12345""678
49 1
50 2
51 3
52 4
53 5
226 â
128 ?
156 ?
226 â
128 ?
157 ?
54 6
55 7
56 8
57 9
48 0
226 â
128 ?
157 ?
Return to end.
So that's kind of fun, and I'm in just before the question's five year anniversary. Though Oren's description of the bits had a small error, that's exactly the trick you want to use. Thanks for the question; neat.
Here are two possible solution - a LINQ one-liner processing the input left to right and a traditional for-loop processing the input from right to left. Which processing direction is faster depends on the string length, the allowed byte length, and the number and distribution of multibyte characters and is hard to give a general suggestion. The decision between LINQ and traditional code I probably a matter of taste (or maybe speed).
If speed matters, one could think about just accumulating the byte length of each character until reaching the maximum length instead of calculating the byte length of the whole string in each iteration. But I am not sure if this will work because I don't know UTF-8 encoding well enough. I could theoreticaly imagine that the byte length of a string does not equal the sum of the byte lengths of all characters.
public static String LimitByteLength(String input, Int32 maxLength)
{
return new String(input
.TakeWhile((c, i) =>
Encoding.UTF8.GetByteCount(input.Substring(0, i + 1)) <= maxLength)
.ToArray());
}
public static String LimitByteLength2(String input, Int32 maxLength)
{
for (Int32 i = input.Length - 1; i >= 0; i--)
{
if (Encoding.UTF8.GetByteCount(input.Substring(0, i + 1)) <= maxLength)
{
return input.Substring(0, i + 1);
}
}
return String.Empty;
}
Shorter version of ruffin's answer. Takes advantage of the design of UTF8:
public static string LimitUtf8ByteCount(this string s, int n)
{
// quick test (we probably won't be trimming most of the time)
if (Encoding.UTF8.GetByteCount(s) <= n)
return s;
// get the bytes
var a = Encoding.UTF8.GetBytes(s);
// if we are in the middle of a character (highest two bits are 10)
if (n > 0 && ( a[n]&0xC0 ) == 0x80)
{
// remove all bytes whose two highest bits are 10
// and one more (start of multi-byte sequence - highest bits should be 11)
while (--n > 0 && ( a[n]&0xC0 ) == 0x80)
;
}
// convert back to string (with the limit adjusted)
return Encoding.UTF8.GetString(a, 0, n);
}
All of the other answers appear to miss the fact that this functionality is already built into .NET, in the Encoder class. For bonus points, this approach will also work for other encodings.
public static string LimitByteLength(string message, int maxLength)
{
if (string.IsNullOrEmpty(message) || Encoding.UTF8.GetByteCount(message) <= maxLength)
{
return message;
}
var encoder = Encoding.UTF8.GetEncoder();
byte[] buffer = new byte[maxLength];
char[] messageChars = message.ToCharArray();
encoder.Convert(
chars: messageChars,
charIndex: 0,
charCount: messageChars.Length,
bytes: buffer,
byteIndex: 0,
byteCount: buffer.Length,
flush: false,
charsUsed: out int charsUsed,
bytesUsed: out int bytesUsed,
completed: out bool completed);
// I don't think we can return message.Substring(0, charsUsed)
// as that's the number of UTF-16 chars, not the number of codepoints
// (think about surrogate pairs). Therefore I think we need to
// actually convert bytes back into a new string
return Encoding.UTF8.GetString(buffer, 0, bytesUsed);
}
If you're using .NET Standard 2.1+, you can simplify it a bit:
public static string LimitByteLength(string message, int maxLength)
{
if (string.IsNullOrEmpty(message) || Encoding.UTF8.GetByteCount(message) <= maxLength)
{
return message;
}
var encoder = Encoding.UTF8.GetEncoder();
byte[] buffer = new byte[maxLength];
encoder.Convert(message.AsSpan(), buffer.AsSpan(), false, out _, out int bytesUsed, out _);
return Encoding.UTF8.GetString(buffer, 0, bytesUsed);
}
None of the other answers account for extended grapheme clusters, such as 👩🏽‍🚒. This is composed of 4 Unicode scalars (👩, 🏽, a zero-width joiner, and 🚒), so you need knowledge of the Unicode standard to avoid splitting it in the middle and producing 👩 or 👩🏽.
In .NET 5 onwards, you can write this as:
public static string LimitByteLength(string message, int maxLength)
{
if (string.IsNullOrEmpty(message) || Encoding.UTF8.GetByteCount(message) <= maxLength)
{
return message;
}
var enumerator = StringInfo.GetTextElementEnumerator(message);
var result = new StringBuilder();
int lengthBytes = 0;
while (enumerator.MoveNext())
{
lengthBytes += Encoding.UTF8.GetByteCount(enumerator.GetTextElement());
if (lengthBytes <= maxLength)
{
result.Append(enumerator.GetTextElement());
}
}
return result.ToString();
}
(This same code runs on earlier versions of .NET, but due to a bug it won't produce the correct result before .NET 5).
If a UTF-8 byte has a zero-valued high order bit, it's the beginning of a character. If its high order bit is 1, it's in the 'middle' of a character. The ability to detect the beginning of a character was an explicit design goal of UTF-8.
Check out the Description section of the wikipedia article for more detail.
Is there a reason that you need the database column to be declared in terms of bytes? That's the default, but it's not a particularly useful default if the database character set is variable width. I'd strongly prefer declaring the column in terms of characters.
CREATE TABLE length_example (
col1 VARCHAR2( 10 BYTE ),
col2 VARCHAR2( 10 CHAR )
);
This will create a table where COL1 will store 10 bytes of data and col2 will store 10 characters worth of data. Character length semantics make far more sense in a UTF8 database.
Assuming you want all the tables you create to use character length semantics by default, you can set the initialization parameter NLS_LENGTH_SEMANTICS to CHAR. At that point, any tables you create will default to using character length semantics rather than byte length semantics if you don't specify CHAR or BYTE in the field length.
Following Oren Trutner's comment here are two more solutions to the problem:
here we count the number of bytes to remove from the end of the string according to each character at the end of the string, so we don't evaluate the entire string in every iteration.
string str = "朣楢琴执执 瑩浻牡楧硰执执獧浻牡楧敬瑦 瀰 絸朣杢执獧扻捡杫潲湵 潣"
int maxBytesLength = 30;
var bytesArr = Encoding.UTF8.GetBytes(str);
int bytesToRemove = 0;
int lastIndexInString = str.Length -1;
while(bytesArr.Length - bytesToRemove > maxBytesLength)
{
bytesToRemove += Encoding.UTF8.GetByteCount(new char[] {str[lastIndexInString]} );
--lastIndexInString;
}
string trimmedString = Encoding.UTF8.GetString(bytesArr,0,bytesArr.Length - bytesToRemove);
//Encoding.UTF8.GetByteCount(trimmedString);//get the actual length, will be <= 朣楢琴执执 瑩浻牡楧硰执执獧浻牡楧敬瑦 瀰 絸朣杢执獧扻捡杫潲湵 潣潬昣昸昸慢正
And an even more efficient(and maintainable) solution:
get the string from the bytes array according to desired length and cut the last character because it might be corrupted
string str = "朣楢琴执执 瑩浻牡楧硰执执獧浻牡楧敬瑦 瀰 絸朣杢执獧扻捡杫潲湵 潣"
int maxBytesLength = 30;
string trimmedWithDirtyLastChar = Encoding.UTF8.GetString(Encoding.UTF8.GetBytes(str),0,maxBytesLength);
string trimmedString = trimmedWithDirtyLastChar.Substring(0,trimmedWithDirtyLastChar.Length - 1);
The only downside with the second solution is that we might cut a perfectly fine last character, but we are already cutting the string, so it might fit with the requirements.
Thanks to Shhade who thought about the second solution
This is another solution based on binary search:
public string LimitToUTF8ByteLength(string text, int size)
{
if (size <= 0)
{
return string.Empty;
}
int maxLength = text.Length;
int minLength = 0;
int length = maxLength;
while (maxLength >= minLength)
{
length = (maxLength + minLength) / 2;
int byteLength = Encoding.UTF8.GetByteCount(text.Substring(0, length));
if (byteLength > size)
{
maxLength = length - 1;
}
else if (byteLength < size)
{
minLength = length + 1;
}
else
{
return text.Substring(0, length);
}
}
// Round down the result
string result = text.Substring(0, length);
if (size >= Encoding.UTF8.GetByteCount(result))
{
return result;
}
else
{
return text.Substring(0, length - 1);
}
}
public static string LimitByteLength3(string input, Int32 maxLenth)
{
string result = input;
int byteCount = Encoding.UTF8.GetByteCount(input);
if (byteCount > maxLenth)
{
var byteArray = Encoding.UTF8.GetBytes(input);
result = Encoding.UTF8.GetString(byteArray, 0, maxLenth);
}
return result;
}

Categories