How to divide long text into bytes c# - c#

I'm already really newbie in coding but my problem is how to divide code likt this "7900BD7400BD7500BD76A5FF" to this "79 00 BD 74 00 BD 75 00 BD 76 A5 FF". My main problem was to convert hex into ascii, but any solution which i got convert only "short" expression. Maybe someone can give me some advices? I'll be really gratefull

A more general solution to the problem:
static String SeparateBy(
this string str,
string separator,
int groupLength)
{
var buffer = new StringBuilder();
for (var i = 0; i < str.Length; i++)
{
if (i % groupLength == 0)
{
buffer.Append(separator);
}
buffer.Append(str[i]);
}
return buffer.ToString();
}
And you'd call it like: "7900BD7400BD7500BD76A5FF".SeparateBy(" ", 2)
When posible, and if its relatively easy, try to generalize methods so they can be reused more often. Of course if things start to get complicated generalizing can be self defeating… knowing when or when not to generalize is a skill you will acquire little by little.

Since you don't seem to have much knowledge in string processing, I'll give an example that does not require you to lern too many things at once:
string input = "7900BD7400BD7500BD76A5FF";
string output = string.Empty;
for(int i=0; i<input.Length; i+=2) // Go in steps of 2
{
output += input[i]; // The first of 2 characters
output += input[i+1]; // The second of 2 characters
output += ' '; // The space
}
Console.WriteLine(output);
Please note that this solution only inserts spaces after every second character. It does not check whether these are all hex codes and whether its length is a multiple of 2. It assumes that whatever code is before generates a valid result. You should ensure that with unit tests.
This approach will not be very efficient for long strings (if you have long text, please learn about StringBuilder).
If you followed this advice for creating the hex data, then it's of course much easier to insert the space right away:
public static string ByteArrayToString(byte[] ba)
{
StringBuilder hex = new StringBuilder(ba.Length * 2);
foreach (byte b in ba)
hex.AppendFormat("{0:X2} ", b); // <-- I inserted a space in the format string
return hex.ToString();
}

Related

Reading text file with reaping pattern c#

I want to read from a text file a hex number, using the last digit to define my length to read a string, again a number and so on until the line will finish.
using (StreamReader sr = new StreamReader(fileName)){
String line = sr.ReadLine();
string hexText = line.Substring(0,9);
char c = hexText[8];
int con = c - '0'; //saving the value
string myHex = con.ToString("X");
int length = Convert.ToInt32(myHex, 16);
string fieldChars = line.Substring(0, length); //getting the key
string b = line.Substring(c, length); }
so for "5A3F00004olga" the length is correct and 4 (the last hex bit) but for some reason b is not Olga.Why?
Let's take a closer look at your code. You have:
char c = hexText[8];
int con = c - '0'; //saving the value
string myHex = con.ToString("X");
int length = Convert.ToInt32(myHex, 16);
string fieldChars = line.Substring(0, length); //getting the key
string b = line.Substring(c, length);
So c contains the value of the character at position 8. I don't have any idea why you're subtracting '0' from it, and then converting the result back to a string. You could just as easily write:
string myHex = c + "X";
Also, if the value at hexText[8] were 'A', then subtracting '0' would give you 17 rather than the 10 that you expected.
I also don't know what you expect the line that assigns fieldChars to do, but I can pretty much guarantee that it's not doing what you want.
The reason b doesn't contain "olga" is because the substring starting position in this case would be 4 (44 in decimal), and length could be totally wrong as described above.
I think the code you want is:
char c = hexText[8];
string myHex = con.ToString("X");
int length = Convert.ToInt32(myHex, 16);
string b = line.Substring(9, length);
I didn't include fieldChars there because it's unclear to me what you want value you want that variable to hold. If you want the key, then you'd want line.Substring(0, 8); If you want the entire field, including the key, the length, and the text, you'd write line.Substring(0, 9 + length);
As suggested in comments, you can use the C# debugger to walk through your code one line at a time, and inspect variables to see what is happening. That's an excellent way to see the results of the code you write. Better yet is to work a few examples by hand before you start writing code. That way you get a better understanding of what the code is supposed to do. It makes writing correct code a whole lot easier.
If you don't know how to use the debugger, now is the perfect time to learn.
Another way you can see what's happening is to put Console.WriteLine statements after each line. For example:
char c = hexText[8];
Console.WriteLine("c = " + c);
If you do that after each line, or after each important group of lines, displaying the values of important variables, you can see exactly where your program is going off into the weeds.

ToString("X") produces single digit hex numbers

We wrote a crude data scope.
(The freeware terminal programs we found were unable to keep up with Bluetooth speeds)
The results are okay, and we are writing them to a Comma separated file for use with a spreadsheet. It would be better to see the hex values line up in nice columns in the RichTextBox instead of the way it looks now (Screen cap appended).
This is the routine that adds the digits (e.g., numbers from 0 to FF) to the text in the RichTextBox.
public void Write(byte[] b)
{
if (writting)
{
for (int i = 0; i < b.Length; i++)
{
storage[sPlace++] = b[i];
pass += b[i].ToString("X") + " "; //// <<<--- Here is the problem
if (sPlace % numericUpDown1.Value == 0)
{
pass += "\r\n";
}
}
}
}
I would like a way for the instruction pass += b[i].ToString("X") + " "; to produce a leading zero on values from 00h to 0Fh
Or, some other way to turn the value in byte b into two alphabetic characters from 00 to FF
Digits on left, FF 40 0 5 Line up nice and neatly, because they are identical. As soon as we encounter any difference in data, the columns vanish and the data become extremely difficult to read with human observation.
Use a composite format string:
pass += b[i].ToString("X2") + " ";
The documentation on MSDN, Standard Numeric Format Strings has examples.

Byte/char buffer to long and/or double

In my code I need to convert string representation of integers to long and double values.
String representation is a byte array (byte[]). For example, for a number 12345 string representation is { 49, 50, 51, 52, 53 }
Currently, I use following obvious code for conversion to long (and almost the same code for conversion to double)
private long bytesToIntValue()
{
string s = System.Text.Encoding.GetEncoding("Latin1").GetString(bytes);
return long.Parse(s, CultureInfo.InvariantCulture);
}
This code works as expected, but in my case I want something better. It's because currently I must convert bytes to string first.
In my case, bytesToIntValue() gets called about 12 million times and about 25% of all memory allocations are made in this method.
Sure, I want to optimize this part. I want to perform conversions without intermediate string (+ speed, - allocation).
What would you recommend? How can I perform conversions without intermediate strings? Is there a faster method to perform conversions?
EDIT:
Byte arrays I am dealing with are always contain ASCII-encoded data. Numbers can be negative. For double values exponential format is allowed. Hexadecimal integers are not allowed.
How can I perform conversions without intermediate strings?
Well you can easily convert each byte to a char. For example - untested:
private static long ConvertAsciiBytesToInt32(byte[] bytes)
{
long value = 0;
foreach (byte b in bytes)
{
value *= 10L;
char c = b; // Implicit conversion; effectively ISO-8859-1
if (c < '0' || c > '9')
{
throw new ArgumentException("Bytes contains non-digit: " + c);
}
value += (c - '0');
}
return value;
}
Note that this really does assume it's ASCII (or compatible) - if your byte array is actually UTF-16 (for example) then it will definitely do the wrong thing.
Also note that this doesn't perform any sort of length validation or overflow checking... and it doesn't cope with negative numbers. You could add all of these if you want, but we don't know enough about your requirements to know if it's worth adding the complexity.
I'm not sure that there is a easy way to do that,
Please note that it won't work with other encodings, The test shown on my computer that this is only 3 times faster (I don't think it worth it).
The code + test :
class MainClass
{
public static void Main(string[] args)
{
string str = "12341234";
byte[] buffer = Encoding.ASCII.GetBytes(str);
Stopwatch sw = Stopwatch.StartNew();
for(int i = 0; i < 1000000 ;i ++)
{
long val = BufferToLong.GetValue(buffer);
}
Console.WriteLine (sw.ElapsedMilliseconds);
sw.Restart();
for (int i = 0 ; i < 1000000 ; i++)
{
string valStr = Encoding.ASCII.GetString(buffer);
long val = long.Parse(valStr);
}
Console.WriteLine (sw.ElapsedMilliseconds);
}
}
static class BufferToLong
{
public static long GetValue(Byte[] buffer) {
long number = 0;
foreach (byte currentByte in buffer) {
char currentChar = (char)currentByte;
int currentDigit = currentChar - '0';
number *= 10 ;
number += currentDigit;
}
return number;
}
}
In the end, I created C# version of strol function. This function comes with CRT and source code of CRT comes with Visual Studio.
The resulting method is almost the same as code provided by #Jon Skeet in his answer but also contains some checks for overflow.
In my case all the changes proved to be very useful in terms of speed and memory.

C# Converting a XOR crypt function

I've been working on converting a C++ crypting method to C#. The problem is, I cant get it to encrypt/decrypt the way I want it to.
The idea is simple, I capture a packet, and decrypt it. The output will be:
Packet Size - Command/Action - Null (End)
(The decryptor cuts off the first and last 2 bytes)
The C++ code is this:
// Crypt the packet with Xor operator
void cryptPacket(char *packet)
{
unsigned short paksize=(*((unsigned short*)&packet[0])) - 2;
for(int i=2; i<paksize; i++)
{
packet[i] = 0x61 ^ packet[i];
}
}
So I thought this would work in C# if I didn't want to use pointers:
public static char[] CryptPacket(char[] packet)
{
ushort paksize = (ushort) (packet.Length - 2);
for(int i=2; i<paksize; i++)
{
packet[i] = (char) (0x61 ^ packet[i]);
}
return packet;
}
-but it isn't, the value returned is just another line of rubish instead of the decrypted value. The output given is: ..O♦&/OOOe.
Well.. atleast the '/' is in the right place for some reason.
Some more information:
The test packet I'm using is this:
Hex value: 0C 00 E2 66 65 47 4E 09 04 13 65 00
Plain text: ...feGN...e.
Decrypted: XX/hereXX
X = Unknown value, I cant really remember, but it doesn't matter.
Using Hex Workshop you can decrypt the packet this way:
Special Paste the hex value as CF_TEXT, make sure the 'treat as hexidecimal value' box is checked.
Afterwards, select everything from the hexidecimal value you just pasted, except the first and last 2 bytes.
Go to Tools>Operations>Xor.
Select 'Treat data as 8 bit data' and set value to '61'.
Press 'OK', and you'r done.
That's all the information I can give at the moment, because I'm writing this off the top of my head.
Thank you for your time.
In case you don't see a question in this:
It would be great if someone could take a look at the code to see what's wrong with it, or if there's another way to do it. I'm converting this code because I'm horrible with C++, and want to create a C# application with that code.
Ps: The code tags and such were a pain, so I'm sorry if the spacing etc. is a little messed up.
Your problem might be that as .NET's char is unicode, some characters are going to be using more than one byte, and your bitmask is only one byte long. So the most significant byte will be left unaltered.
I just tried your function and it seems ok:
class Program
{
// OP's method: http://stackoverflow.com/questions/4815959
public static byte[] CryptPacket(byte[] packet)
{
int paksize = packet.Length - 2;
for (int i = 2; i < paksize; i++)
{
packet[i] = (byte)(0x61 ^ packet[i]);
}
return packet;
}
// http://stackoverflow.com/questions/321370 :)
public static byte[] StringToByteArray(string hex)
{
return Enumerable.Range(0, hex.Length).
Where(x => 0 == x % 2).
Select(x => Convert.ToByte(hex.Substring(x, 2), 16)).
ToArray();
}
static void Main(string[] args)
{
string hex = "0C 00 E2 66 65 47 4E 09 04 13 65 00".Replace(" ", "");
byte[] input = StringToByteArray(hex);
Console.WriteLine("Input: " + ASCIIEncoding.ASCII.GetString(input));
byte[] output = CryptPacket(input);
Console.WriteLine("Output: " + ASCIIEncoding.ASCII.GetString(output));
Console.ReadLine();
}
}
Console output:
Input: ...feGN.....
Output: ...../here..
(where '.' represents funny ascii characters)
It seems a bit smelly that your CryptPacket method is overwriting the initial array with the output values. And that irrelevant characters are not trimmed. But if you are trying to port something, I guess you should know what you are doing.
You could also consider trimming the input array, to remove the unwanted characters first, and then use a generic ROT13 method (like this one). This way you have your own "specialized" version with 2-byte offsets inside the crypt function itself, instead of something like:
public static byte[] CryptPacket(byte[] packet)
{
// create a new instance
byte[] output = new byte[packet.Length];
// process ALL array items
for (int i = 0; i < packet.Length; i++)
{
output[i] = (byte)(0x61 ^ packet[i]);
}
return output;
}
Here's an almost literal translation from C++ to C#, and it seems to work:
var packet = new byte[] {
0x0C, 0x00, 0xE2, 0x66, 0x65, 0x47,
0x4E, 0x09, 0x04, 0x13, 0x65, 0x00
};
CryptPacket(packet);
// displays "....../here." where "." represents an unprintable character
Console.WriteLine(Encoding.ASCII.GetString(packet));
// ...
void CryptPacket(byte[] packet)
{
int paksize = (packet[0] | (packet[1] << 8)) - 2;
for (int i = 2; i < paksize; i++)
{
packet[i] ^= 0x61;
}
}

Best way to shorten UTF8 string based on byte length

A recent project called for importing data into an Oracle database. The program that will do this is a C# .Net 3.5 app and I'm using the Oracle.DataAccess connection library to handle the actual inserting.
I ran into a problem where I'd receive this error message when inserting a particular field:
ORA-12899 Value too large for column X
I used Field.Substring(0, MaxLength); but still got the error (though not for every record).
Finally I saw what should have been obvious, my string was in ANSI and the field was UTF8. Its length is defined in bytes, not characters.
This gets me to my question. What is the best way to trim my string to fix the MaxLength?
My substring code works by character length. Is there simple C# function that can trim a UT8 string intelligently by byte length (ie not hack off half a character) ?
I think we can do better than naively counting the total length of a string with each addition. LINQ is cool, but it can accidentally encourage inefficient code. What if I wanted the first 80,000 bytes of a giant UTF string? That's a lot of unnecessary counting. "I've got 1 byte. Now I've got 2. Now I've got 13... Now I have 52,384..."
That's silly. Most of the time, at least in l'anglais, we can cut exactly on that nth byte. Even in another language, we're less than 6 bytes away from a good cutting point.
So I'm going to start from #Oren's suggestion, which is to key off of the leading bit of a UTF8 char value. Let's start by cutting right at the n+1th byte, and use Oren's trick to figure out if we need to cut a few bytes earlier.
Three possibilities
If the first byte after the cut has a 0 in the leading bit, I know I'm cutting precisely before a single byte (conventional ASCII) character, and can cut cleanly.
If I have a 11 following the cut, the next byte after the cut is the start of a multi-byte character, so that's a good place to cut too!
If I have a 10, however, I know I'm in the middle of a multi-byte character, and need to go back to check to see where it really starts.
That is, though I want to cut the string after the nth byte, if that n+1th byte comes in the middle of a multi-byte character, cutting would create an invalid UTF8 value. I need to back up until I get to one that starts with 11 and cut just before it.
Code
Notes: I'm using stuff like Convert.ToByte("11000000", 2) so that it's easy to tell what bits I'm masking (a little more about bit masking here). In a nutshell, I'm &ing to return what's in the byte's first two bits and bringing back 0s for the rest. Then I check the XX from XX000000 to see if it's 10 or 11, where appropriate.
I found out today that C# 6.0 might actually support binary representations, which is cool, but we'll keep using this kludge for now to illustrate what's going on.
The PadLeft is just because I'm overly OCD about output to the Console.
So here's a function that'll cut you down to a string that's n bytes long or the greatest number less than n that's ends with a "complete" UTF8 character.
public static string CutToUTF8Length(string str, int byteLength)
{
byte[] byteArray = Encoding.UTF8.GetBytes(str);
string returnValue = string.Empty;
if (byteArray.Length > byteLength)
{
int bytePointer = byteLength;
// Check high bit to see if we're [potentially] in the middle of a multi-byte char
if (bytePointer >= 0
&& (byteArray[bytePointer] & Convert.ToByte("10000000", 2)) > 0)
{
// If so, keep walking back until we have a byte starting with `11`,
// which means the first byte of a multi-byte UTF8 character.
while (bytePointer >= 0
&& Convert.ToByte("11000000", 2) != (byteArray[bytePointer] & Convert.ToByte("11000000", 2)))
{
bytePointer--;
}
}
// See if we had 1s in the high bit all the way back. If so, we're toast. Return empty string.
if (0 != bytePointer)
{
returnValue = Encoding.UTF8.GetString(byteArray, 0, bytePointer); // hat tip to #NealEhardt! Well played. ;^)
}
}
else
{
returnValue = str;
}
return returnValue;
}
I initially wrote this as a string extension. Just add back the this before string str to put it back into extension format, of course. I removed the this so that we could just slap the method into Program.cs in a simple console app to demonstrate.
Test and expected output
Here's a good test case, with the output it create below, written expecting to be the Main method in a simple console app's Program.cs.
static void Main(string[] args)
{
string testValue = "12345“”67890”";
for (int i = 0; i < 15; i++)
{
string cutValue = Program.CutToUTF8Length(testValue, i);
Console.WriteLine(i.ToString().PadLeft(2) +
": " + Encoding.UTF8.GetByteCount(cutValue).ToString().PadLeft(2) +
":: " + cutValue);
}
Console.WriteLine();
Console.WriteLine();
foreach (byte b in Encoding.UTF8.GetBytes(testValue))
{
Console.WriteLine(b.ToString().PadLeft(3) + " " + (char)b);
}
Console.WriteLine("Return to end.");
Console.ReadLine();
}
Output follows. Notice that the "smart quotes" in testValue are three bytes long in UTF8 (though when we write the chars to the console in ASCII, it outputs dumb quotes). Also note the ?s output for the second and third bytes of each smart quote in the output.
The first five characters of our testValue are single bytes in UTF8, so 0-5 byte values should be 0-5 characters. Then we have a three-byte smart quote, which can't be included in its entirety until 5 + 3 bytes. Sure enough, we see that pop out at the call for 8.Our next smart quote pops out at 8 + 3 = 11, and then we're back to single byte characters through 14.
0: 0::
1: 1:: 1
2: 2:: 12
3: 3:: 123
4: 4:: 1234
5: 5:: 12345
6: 5:: 12345
7: 5:: 12345
8: 8:: 12345"
9: 8:: 12345"
10: 8:: 12345"
11: 11:: 12345""
12: 12:: 12345""6
13: 13:: 12345""67
14: 14:: 12345""678
49 1
50 2
51 3
52 4
53 5
226 â
128 ?
156 ?
226 â
128 ?
157 ?
54 6
55 7
56 8
57 9
48 0
226 â
128 ?
157 ?
Return to end.
So that's kind of fun, and I'm in just before the question's five year anniversary. Though Oren's description of the bits had a small error, that's exactly the trick you want to use. Thanks for the question; neat.
Here are two possible solution - a LINQ one-liner processing the input left to right and a traditional for-loop processing the input from right to left. Which processing direction is faster depends on the string length, the allowed byte length, and the number and distribution of multibyte characters and is hard to give a general suggestion. The decision between LINQ and traditional code I probably a matter of taste (or maybe speed).
If speed matters, one could think about just accumulating the byte length of each character until reaching the maximum length instead of calculating the byte length of the whole string in each iteration. But I am not sure if this will work because I don't know UTF-8 encoding well enough. I could theoreticaly imagine that the byte length of a string does not equal the sum of the byte lengths of all characters.
public static String LimitByteLength(String input, Int32 maxLength)
{
return new String(input
.TakeWhile((c, i) =>
Encoding.UTF8.GetByteCount(input.Substring(0, i + 1)) <= maxLength)
.ToArray());
}
public static String LimitByteLength2(String input, Int32 maxLength)
{
for (Int32 i = input.Length - 1; i >= 0; i--)
{
if (Encoding.UTF8.GetByteCount(input.Substring(0, i + 1)) <= maxLength)
{
return input.Substring(0, i + 1);
}
}
return String.Empty;
}
Shorter version of ruffin's answer. Takes advantage of the design of UTF8:
public static string LimitUtf8ByteCount(this string s, int n)
{
// quick test (we probably won't be trimming most of the time)
if (Encoding.UTF8.GetByteCount(s) <= n)
return s;
// get the bytes
var a = Encoding.UTF8.GetBytes(s);
// if we are in the middle of a character (highest two bits are 10)
if (n > 0 && ( a[n]&0xC0 ) == 0x80)
{
// remove all bytes whose two highest bits are 10
// and one more (start of multi-byte sequence - highest bits should be 11)
while (--n > 0 && ( a[n]&0xC0 ) == 0x80)
;
}
// convert back to string (with the limit adjusted)
return Encoding.UTF8.GetString(a, 0, n);
}
All of the other answers appear to miss the fact that this functionality is already built into .NET, in the Encoder class. For bonus points, this approach will also work for other encodings.
public static string LimitByteLength(string message, int maxLength)
{
if (string.IsNullOrEmpty(message) || Encoding.UTF8.GetByteCount(message) <= maxLength)
{
return message;
}
var encoder = Encoding.UTF8.GetEncoder();
byte[] buffer = new byte[maxLength];
char[] messageChars = message.ToCharArray();
encoder.Convert(
chars: messageChars,
charIndex: 0,
charCount: messageChars.Length,
bytes: buffer,
byteIndex: 0,
byteCount: buffer.Length,
flush: false,
charsUsed: out int charsUsed,
bytesUsed: out int bytesUsed,
completed: out bool completed);
// I don't think we can return message.Substring(0, charsUsed)
// as that's the number of UTF-16 chars, not the number of codepoints
// (think about surrogate pairs). Therefore I think we need to
// actually convert bytes back into a new string
return Encoding.UTF8.GetString(buffer, 0, bytesUsed);
}
If you're using .NET Standard 2.1+, you can simplify it a bit:
public static string LimitByteLength(string message, int maxLength)
{
if (string.IsNullOrEmpty(message) || Encoding.UTF8.GetByteCount(message) <= maxLength)
{
return message;
}
var encoder = Encoding.UTF8.GetEncoder();
byte[] buffer = new byte[maxLength];
encoder.Convert(message.AsSpan(), buffer.AsSpan(), false, out _, out int bytesUsed, out _);
return Encoding.UTF8.GetString(buffer, 0, bytesUsed);
}
None of the other answers account for extended grapheme clusters, such as 👩🏽‍🚒. This is composed of 4 Unicode scalars (👩, 🏽, a zero-width joiner, and 🚒), so you need knowledge of the Unicode standard to avoid splitting it in the middle and producing 👩 or 👩🏽.
In .NET 5 onwards, you can write this as:
public static string LimitByteLength(string message, int maxLength)
{
if (string.IsNullOrEmpty(message) || Encoding.UTF8.GetByteCount(message) <= maxLength)
{
return message;
}
var enumerator = StringInfo.GetTextElementEnumerator(message);
var result = new StringBuilder();
int lengthBytes = 0;
while (enumerator.MoveNext())
{
lengthBytes += Encoding.UTF8.GetByteCount(enumerator.GetTextElement());
if (lengthBytes <= maxLength)
{
result.Append(enumerator.GetTextElement());
}
}
return result.ToString();
}
(This same code runs on earlier versions of .NET, but due to a bug it won't produce the correct result before .NET 5).
If a UTF-8 byte has a zero-valued high order bit, it's the beginning of a character. If its high order bit is 1, it's in the 'middle' of a character. The ability to detect the beginning of a character was an explicit design goal of UTF-8.
Check out the Description section of the wikipedia article for more detail.
Is there a reason that you need the database column to be declared in terms of bytes? That's the default, but it's not a particularly useful default if the database character set is variable width. I'd strongly prefer declaring the column in terms of characters.
CREATE TABLE length_example (
col1 VARCHAR2( 10 BYTE ),
col2 VARCHAR2( 10 CHAR )
);
This will create a table where COL1 will store 10 bytes of data and col2 will store 10 characters worth of data. Character length semantics make far more sense in a UTF8 database.
Assuming you want all the tables you create to use character length semantics by default, you can set the initialization parameter NLS_LENGTH_SEMANTICS to CHAR. At that point, any tables you create will default to using character length semantics rather than byte length semantics if you don't specify CHAR or BYTE in the field length.
Following Oren Trutner's comment here are two more solutions to the problem:
here we count the number of bytes to remove from the end of the string according to each character at the end of the string, so we don't evaluate the entire string in every iteration.
string str = "朣楢琴执执 瑩浻牡楧硰执执獧浻牡楧敬瑦 瀰 絸朣杢执獧扻捡杫潲湵 潣"
int maxBytesLength = 30;
var bytesArr = Encoding.UTF8.GetBytes(str);
int bytesToRemove = 0;
int lastIndexInString = str.Length -1;
while(bytesArr.Length - bytesToRemove > maxBytesLength)
{
bytesToRemove += Encoding.UTF8.GetByteCount(new char[] {str[lastIndexInString]} );
--lastIndexInString;
}
string trimmedString = Encoding.UTF8.GetString(bytesArr,0,bytesArr.Length - bytesToRemove);
//Encoding.UTF8.GetByteCount(trimmedString);//get the actual length, will be <= 朣楢琴执执 瑩浻牡楧硰执执獧浻牡楧敬瑦 瀰 絸朣杢执獧扻捡杫潲湵 潣潬昣昸昸慢正
And an even more efficient(and maintainable) solution:
get the string from the bytes array according to desired length and cut the last character because it might be corrupted
string str = "朣楢琴执执 瑩浻牡楧硰执执獧浻牡楧敬瑦 瀰 絸朣杢执獧扻捡杫潲湵 潣"
int maxBytesLength = 30;
string trimmedWithDirtyLastChar = Encoding.UTF8.GetString(Encoding.UTF8.GetBytes(str),0,maxBytesLength);
string trimmedString = trimmedWithDirtyLastChar.Substring(0,trimmedWithDirtyLastChar.Length - 1);
The only downside with the second solution is that we might cut a perfectly fine last character, but we are already cutting the string, so it might fit with the requirements.
Thanks to Shhade who thought about the second solution
This is another solution based on binary search:
public string LimitToUTF8ByteLength(string text, int size)
{
if (size <= 0)
{
return string.Empty;
}
int maxLength = text.Length;
int minLength = 0;
int length = maxLength;
while (maxLength >= minLength)
{
length = (maxLength + minLength) / 2;
int byteLength = Encoding.UTF8.GetByteCount(text.Substring(0, length));
if (byteLength > size)
{
maxLength = length - 1;
}
else if (byteLength < size)
{
minLength = length + 1;
}
else
{
return text.Substring(0, length);
}
}
// Round down the result
string result = text.Substring(0, length);
if (size >= Encoding.UTF8.GetByteCount(result))
{
return result;
}
else
{
return text.Substring(0, length - 1);
}
}
public static string LimitByteLength3(string input, Int32 maxLenth)
{
string result = input;
int byteCount = Encoding.UTF8.GetByteCount(input);
if (byteCount > maxLenth)
{
var byteArray = Encoding.UTF8.GetBytes(input);
result = Encoding.UTF8.GetString(byteArray, 0, maxLenth);
}
return result;
}

Categories