Reading text file with reaping pattern c# - c#

I want to read from a text file a hex number, using the last digit to define my length to read a string, again a number and so on until the line will finish.
using (StreamReader sr = new StreamReader(fileName)){
String line = sr.ReadLine();
string hexText = line.Substring(0,9);
char c = hexText[8];
int con = c - '0'; //saving the value
string myHex = con.ToString("X");
int length = Convert.ToInt32(myHex, 16);
string fieldChars = line.Substring(0, length); //getting the key
string b = line.Substring(c, length); }
so for "5A3F00004olga" the length is correct and 4 (the last hex bit) but for some reason b is not Olga.Why?

Let's take a closer look at your code. You have:
char c = hexText[8];
int con = c - '0'; //saving the value
string myHex = con.ToString("X");
int length = Convert.ToInt32(myHex, 16);
string fieldChars = line.Substring(0, length); //getting the key
string b = line.Substring(c, length);
So c contains the value of the character at position 8. I don't have any idea why you're subtracting '0' from it, and then converting the result back to a string. You could just as easily write:
string myHex = c + "X";
Also, if the value at hexText[8] were 'A', then subtracting '0' would give you 17 rather than the 10 that you expected.
I also don't know what you expect the line that assigns fieldChars to do, but I can pretty much guarantee that it's not doing what you want.
The reason b doesn't contain "olga" is because the substring starting position in this case would be 4 (44 in decimal), and length could be totally wrong as described above.
I think the code you want is:
char c = hexText[8];
string myHex = con.ToString("X");
int length = Convert.ToInt32(myHex, 16);
string b = line.Substring(9, length);
I didn't include fieldChars there because it's unclear to me what you want value you want that variable to hold. If you want the key, then you'd want line.Substring(0, 8); If you want the entire field, including the key, the length, and the text, you'd write line.Substring(0, 9 + length);
As suggested in comments, you can use the C# debugger to walk through your code one line at a time, and inspect variables to see what is happening. That's an excellent way to see the results of the code you write. Better yet is to work a few examples by hand before you start writing code. That way you get a better understanding of what the code is supposed to do. It makes writing correct code a whole lot easier.
If you don't know how to use the debugger, now is the perfect time to learn.
Another way you can see what's happening is to put Console.WriteLine statements after each line. For example:
char c = hexText[8];
Console.WriteLine("c = " + c);
If you do that after each line, or after each important group of lines, displaying the values of important variables, you can see exactly where your program is going off into the weeds.

Related

I am trying to decrypt the autokey cipher in C#, but I keep getting an unhandled exception

I know how to fix the error, but I don't know where to place the line of code to make it successfully run. The place that I am having issues with is where I try and create the passletter character.
static string AutokeyDecrypt(string encryptmess, string pass)
{
//sets both secret message and password to arrays so they can be shifted
char[] passkey = pass.ToCharArray();
char[] decryptmess = encryptmess.ToCharArray();
char newletter = ' ';
for (int i = 0; i < decryptmess.Length; i++)
{
char passletter = (char)(passkey[i] + newletter); //This is the line on which I am having issues. I need to concatenate the key (passletter) with each newletter that I decode.
char messletter = decryptmess[i];
//shifts the letters in the message back to original using the first letter of the concatenated key
int shift = passletter - ' '; // passletter - (space character)
newletter = (char)(messletter - shift); // Add shift to message letter
//loops through the ASCII table
if (newletter > '~')
{
newletter = (char)(newletter - 94);
}
else if (newletter < ' ')
{
newletter = (char)(newletter + 94);
}
decryptmess[i] = newletter;
}
return new string(decryptmess);
}
The problem is almost certainly that you are trying to access an index in the passkey array that is beyond the end of the array, which results in an IndexOutOfRange exception.
Specifically:
char passletter = (char)(passkey[i] + newletter);
Let's say that you have a password that is 10 characters in length. The valid indices for the passkey array are 0 to 9. At the 11th character of the message (i == 10) your code attempts to read from index 10, which is invalid.
The standard way to handle this is to use the modulus operator % to wrap the indexing through the valid values:
char passletter = (char)(passkey[i % passkey.Length] + newletter);
At i = 10 (for a 10 character array) this will return the first character (index 0) from the array.
Beyond the indexing error, there are some other issues that you should address.
At the top of your code you do this:
char[] passkey = new char[pass.Length];
passkey = pass.ToCharArray();
This creates an empty array, then immediately replaces it with a new array that is created from the pass string. The initial array is wasted space that will be disposed of by the garbage collector and isn't necessary here.
Long story short, replace the above code (and the same pattern immediately after it for decryptmess) with:
char[] passkey = pass.ToCharArray();
This is a very common pattern for people coming to C# from C or C++ where you often have to allocate arrays and then fill them and most code assumes that you will be giving it an array to fill rather than expecting it to allocate and return. There are reasons why you might want to do either, mostly around resource management that we don't tend to use in C#.
In C# it is more normal for arrays to be created by methods like ToArray(), ToCharArray(), etc. There can be some performance implications here, but that's how it goes. When you call a method that returns an array or other object it is not necessary for you to pre-allocate the array or object.
The other answer didn't factor in the AutoKey cypher, so here's a better one.
Using the cypher description and examples here I think it's clear that your code isn't going to implement the decryption correctly.
The encryption algorithm uses a key stream that is composed of the message text appended to the password, and the decryption process rebuilds the key stream as it goes.
I've chosen to implement based on a flexible alphabet that looks like this:
char[] alphabet = " !\"#$%&'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~".ToCharArray();
This includes all ASCII printable characters from 32 () to 127 (~) in numeric order, but could be any order and/or useful subset of the range. The alphabet is the first defining characteristic of your specific variant of the cypher. From your code it appears that the alphabet above matches what you're expecting.
Each encrypted character is calculated from the characters in the key stream and message at the location by looking up their indices in the array, adding those indices and locating the encrypted character in the alphabet:
string Encrypt(string plain, string pass)
{
char[] message = plain.ToCharArray();
char[] keystream = (pass + plain).ToCharArray();
for (int i = 0; i < message.Length; i++)
{
int keyidx = Array.IndexOf(alphabet, keystream[i]);
int msgidx = Array.IndexOf(alphabet, message[i]);
message[i] = alphabet[(alphabet.Length + keyidx + msgidx) % alphabet.Length];
}
return new string(message);
}
NB: the line that sets message[i] uses a modulo operation to constrain the result to the array bounds. Since keyidx + msgidx could both be a negative and the % operator doesn't wrap this (-1 % 10 == -1, etc) I've added alphabet.Length to ensure only positive indices.
Decryption is basically the same except that the output of the decryption loop is used to build the content of the key stream as it goes. The key stream is initialized in the same way - concatenate password and message - but is updated as the decryption progresses.
string Decrypt(string encrypted, string pass)
{
char[] message = encrypted.ToCharArray();
char[] keystream = (pass + encrypted).ToCharArray();
for (int i = 0; i < message.Length; i++)
{
int keyidx = Array.IndexOf(alphabet, keystream[i]);
int msgidx = Array.IndexOf(alphabet, message[i]);
message[i] = alphabet[(alphabet.Length + msgidx - keyidx) % alphabet.Length];
keystream[i + pass.Length] = message[i];
}
return new string(message);
}
Constructing the keystream buffer in this way ensures that we always have sufficient space to update, even when we reach the end of the process and no longer need the last few characters.
Of course the code above assumes that your inputs are going to be well behaved and not include anything outside of the alphabet values. Any value in the password or the message that doesn't match one of the values in the alphabet will be treated as identical to the last character in the alphabet since Array.IndexOf returns -1 for any failed search. So if your input string includes any Unicode or high ASCII characters they will be rendered as ~ in the output.

adler-32 algorithm example Clarity would be great

I am having a hard time understanding what exactly is going on behind this algorithm. So, I have the following code which I believe works for the Wikipedia example. I seem to be having problems matching up the correct outcomes of hex values. While for the wiki example I get the correct hex value, It seems that my int finalValue; is not the correct value.
string fText, fileName, output;
Int32 a = 1 , b = 0;
const int MOD_ADLER = 65521;
const int ADLER_CONST2 = 65536;
private void btnCalculate_Click(object sender, EventArgs e) {
fileName = tbFilePath.Text;
if(fileName != "" && File.Exists(fileName)) {
fText = File.ReadAllText(fileName);
foreach (char i in fText) {
a = ( a + Convert.ToInt32(i)) % MOD_ADLER;
b = (b + a) % MOD_ADLER;
}
int finalValue = (b * ADLER_CONST2 + a);
output = finalValue.ToString("X");
lbValue.Text = output.ToString();
}
else {
MessageBox.Show("This is not a valid filepath, or is a blank file.\n" +
"Please enter a valid file path.");
}
}
I understand that this is not an efficient way to go about this, I am just trying to understand what is really going on under the hood. That way I can create a more efficient algorithm that varies from this.
From my understanding. In my code, the example value a is going to be added the integer (32 bit) value plus its initial value of 1. I do the Mod of the very high prime number, and continue moving through the sub-string of my text file adding up the values until all of the characters have been added up.
Probably this two lines confuse you.
a = ( a + Convert.ToInt32(i)) % MOD_ADLER;
b = (b + a) % MOD_ADLER;
Every char have integer representation. You can check this article. You are changing the value a to be the reminder-> from current value of a + int representetion of the char divided by MOD_ADLER. You can read operator %
What is reminder: 5%2 = 1
After that you are making same thing for b. b is equal to the reminder current value of b+a divided by MOD_ADLER. After you do that multiple times ( number of chars in the string). You have this.
int finalValue = (b * ADLER_CONST2 + a);
output = finalValue.ToString("X");
This converts the final integer value to HEX.
output = finalValue.ToString("X");
The "X" format says generate the hexadecimal represent of the number!
See MSDN Standard Numeric Format Strings

Error "Hex string have a odd number of digits" while converting int->hex->binary in C#

Aim :
To convert a integer value first to hexstring and then to byte[].
Example :
Need to convert int:1024 to hexstring:400 to byte[]: 00000100 00000000
Method:
For converting from integer to hex string i tried below code
int i=1024;
string hexString = i.ToString("X");
i got hexstring value as "400". Then i tried converting hex string to byte[] using below code
byte[] value = HexStringToByteArray(hexValue);
/* function for converting hexstring to byte array */
public byte[] HexStringToByteArray(string hex)
{
int NumberChars = hex.Length;
if(NumberChars %2==1)
throw new Exception("Hex string cannot have an odd number of digits.");
byte[] bytes = new byte[NumberChars / 2];
for (int i = 0; i < NumberChars; i += 2)
bytes[i / 2] = Convert.ToByte(hex.Substring(i, 2), 16);
return bytes;
}
Error:
Here i got the exception "Hex String cannot have a odd number of digits"
Solution: ??
You can force the ToString to return a specific number of digits:
string hexString = i.ToString("X08");
The exception is thrown by your own code. You can make your code more flexible to accept hex strings that have an odd number of digits:
if (hex.Length % 2 == 1) hex = "0"+hex;
Now you can remove the odd/even check, and your code will be alright.
Your code throws the exception you're seeing:
throw new Exception("Hex string cannot have an odd number of digits.");
You can improve the conversion method to also accept odd hex string lengths like this:
using System.Collections.Generic;
using System.Linq;
// ...
public byte[] HexStringToByteArray(string hex)
{
var result = new List<byte>();
for (int i = hex.Length - 1; i >= 0; i -= 2)
{
if (i > 0)
{
result.Insert(0, Convert.ToByte(hex.Substring(i - 1, 2), 16));
}
else
{
result.Insert(0, Convert.ToByte(hex.Substring(i, 1), 16));
}
}
return bytes.ToArray();
}
This code should iterate through the hex string from its end, adding new bytes to the beginning of the resulting list (that will be transformed into an array before returning the value). If a single digit remains, it will be treated separately.
Your hex string has an odd number of digits and you are explicitly checking for that and throwing the exception. You need to decide why you put this line of code in there and whether you need to remove that in favour of other logic.
Other options are:
add a "0" to the beginning of the string to make it even length
force whoever is calling that code to always provide an even length string
change the later code to deal with odd numbers of characters properly...
In comments you have suggested that the first is what you need to know in which case:
if(hex.Length%2==1)
hex = "0"+hex;
Put this at the beginning of your method and if you get an odd number in then you will add the zero to it automatically. You can of course then take out your later check and exception throw.
Of note is that you may want to validate the input string as hex or possibly just put a try catch round the conversion to make sure that it is a valid hex string.
Also since it isn't clear whether the string is a necessary intermediate step or just one that you think is necessary, you might be interested in C# int to byte[] which deals with converting to bytes without the intermediate string.

Search ReadAllBytes for specific values

I am writing a program that reads '.exe' files and stores their hex values in an array of bytes for comparison with an array containing a series of values. (like a very simple virus scanner)
byte[] buffer = File.ReadAllBytes(currentDirectoryContents[j]);
I have then used BitConverter to create a single string of these values
string hex = BitConverter.ToString(buffer);
The next step is to search this string for a series of values(definitions) and return positive for a match. This is where I am running into problems. My definitions are hex values but created and saved in notepad as defintions.xyz
string[] definitions = File.ReadAllLines(#"C:\definitions.xyz");
I had been trying to read them into a string array and compare the definition elements of the array with string hex
bool[] test = new bool[currentDirectoryContents.Length];
test[j] = hex.Contains(definitions[i]);
This IS a section from a piece of homework, which is why I am not posting my entire code for the program. I had not used C# before last Friday so am most likely making silly mistakes at this point.
Any advice much appreciated :)
It is pretty unclear exactly what kind of format you use of the definitions. Base64 is a good encoding for a byte[], you can rapidly convert back and forth with Convert.ToBase64String and Convert.FromBase64String(). But your question suggests the bytes are encoded in hex. Let's assume it looks like "01020304" for a new byte[] { 1, 2, 3, 4}. Then this helper function converts such a string back to a byte[]:
static byte[] Hex2Bytes(string hex) {
if (hex.Length % 2 != 0) throw new ArgumentException();
var retval = new byte[hex.Length / 2];
for (int ix = 0; ix < hex.Length; ix += 2) {
retval[ix / 2] = byte.Parse(hex.Substring(ix, 2), System.Globalization.NumberStyles.HexNumber);
}
return retval;
}
You can now do a fast pattern search with an algorithm like Boyer-Moore.
I expect you understand that this is a very inefficient way to do it. But except for that, you should just do something like this:
bool[] test = new bool[currentDirectoryContents.Length];
for(int i=0;i<test.Length;i++){
byte[] buffer = File.ReadAllBytes(currentDirectoryContents[j]);
string hex = BitConverter.ToString(buffer);
test[i] = ContainsAny(hex, definitions);
}
bool ContainsAny(string s, string[] values){
foreach(string value in values){
if(s.Contains(value){
return true;
}
}
return false;
}
If you can use LINQ, you can do it like this:
var test = currentDirectoryContents.Select(
file=>definitions.Any(
definition =>
BitConverter.ToString(
File.ReadAllBytes(file)
).Contains(definition)
)
).ToArray();
Also, make sure that your definitions-file is formatted in a way that matches the output of BitConverter.ToString(): upper-case with dashes separating each encoded byte:
12-AB-F0-34
54-AC-FF-01-02

Best way to shorten UTF8 string based on byte length

A recent project called for importing data into an Oracle database. The program that will do this is a C# .Net 3.5 app and I'm using the Oracle.DataAccess connection library to handle the actual inserting.
I ran into a problem where I'd receive this error message when inserting a particular field:
ORA-12899 Value too large for column X
I used Field.Substring(0, MaxLength); but still got the error (though not for every record).
Finally I saw what should have been obvious, my string was in ANSI and the field was UTF8. Its length is defined in bytes, not characters.
This gets me to my question. What is the best way to trim my string to fix the MaxLength?
My substring code works by character length. Is there simple C# function that can trim a UT8 string intelligently by byte length (ie not hack off half a character) ?
I think we can do better than naively counting the total length of a string with each addition. LINQ is cool, but it can accidentally encourage inefficient code. What if I wanted the first 80,000 bytes of a giant UTF string? That's a lot of unnecessary counting. "I've got 1 byte. Now I've got 2. Now I've got 13... Now I have 52,384..."
That's silly. Most of the time, at least in l'anglais, we can cut exactly on that nth byte. Even in another language, we're less than 6 bytes away from a good cutting point.
So I'm going to start from #Oren's suggestion, which is to key off of the leading bit of a UTF8 char value. Let's start by cutting right at the n+1th byte, and use Oren's trick to figure out if we need to cut a few bytes earlier.
Three possibilities
If the first byte after the cut has a 0 in the leading bit, I know I'm cutting precisely before a single byte (conventional ASCII) character, and can cut cleanly.
If I have a 11 following the cut, the next byte after the cut is the start of a multi-byte character, so that's a good place to cut too!
If I have a 10, however, I know I'm in the middle of a multi-byte character, and need to go back to check to see where it really starts.
That is, though I want to cut the string after the nth byte, if that n+1th byte comes in the middle of a multi-byte character, cutting would create an invalid UTF8 value. I need to back up until I get to one that starts with 11 and cut just before it.
Code
Notes: I'm using stuff like Convert.ToByte("11000000", 2) so that it's easy to tell what bits I'm masking (a little more about bit masking here). In a nutshell, I'm &ing to return what's in the byte's first two bits and bringing back 0s for the rest. Then I check the XX from XX000000 to see if it's 10 or 11, where appropriate.
I found out today that C# 6.0 might actually support binary representations, which is cool, but we'll keep using this kludge for now to illustrate what's going on.
The PadLeft is just because I'm overly OCD about output to the Console.
So here's a function that'll cut you down to a string that's n bytes long or the greatest number less than n that's ends with a "complete" UTF8 character.
public static string CutToUTF8Length(string str, int byteLength)
{
byte[] byteArray = Encoding.UTF8.GetBytes(str);
string returnValue = string.Empty;
if (byteArray.Length > byteLength)
{
int bytePointer = byteLength;
// Check high bit to see if we're [potentially] in the middle of a multi-byte char
if (bytePointer >= 0
&& (byteArray[bytePointer] & Convert.ToByte("10000000", 2)) > 0)
{
// If so, keep walking back until we have a byte starting with `11`,
// which means the first byte of a multi-byte UTF8 character.
while (bytePointer >= 0
&& Convert.ToByte("11000000", 2) != (byteArray[bytePointer] & Convert.ToByte("11000000", 2)))
{
bytePointer--;
}
}
// See if we had 1s in the high bit all the way back. If so, we're toast. Return empty string.
if (0 != bytePointer)
{
returnValue = Encoding.UTF8.GetString(byteArray, 0, bytePointer); // hat tip to #NealEhardt! Well played. ;^)
}
}
else
{
returnValue = str;
}
return returnValue;
}
I initially wrote this as a string extension. Just add back the this before string str to put it back into extension format, of course. I removed the this so that we could just slap the method into Program.cs in a simple console app to demonstrate.
Test and expected output
Here's a good test case, with the output it create below, written expecting to be the Main method in a simple console app's Program.cs.
static void Main(string[] args)
{
string testValue = "12345“”67890”";
for (int i = 0; i < 15; i++)
{
string cutValue = Program.CutToUTF8Length(testValue, i);
Console.WriteLine(i.ToString().PadLeft(2) +
": " + Encoding.UTF8.GetByteCount(cutValue).ToString().PadLeft(2) +
":: " + cutValue);
}
Console.WriteLine();
Console.WriteLine();
foreach (byte b in Encoding.UTF8.GetBytes(testValue))
{
Console.WriteLine(b.ToString().PadLeft(3) + " " + (char)b);
}
Console.WriteLine("Return to end.");
Console.ReadLine();
}
Output follows. Notice that the "smart quotes" in testValue are three bytes long in UTF8 (though when we write the chars to the console in ASCII, it outputs dumb quotes). Also note the ?s output for the second and third bytes of each smart quote in the output.
The first five characters of our testValue are single bytes in UTF8, so 0-5 byte values should be 0-5 characters. Then we have a three-byte smart quote, which can't be included in its entirety until 5 + 3 bytes. Sure enough, we see that pop out at the call for 8.Our next smart quote pops out at 8 + 3 = 11, and then we're back to single byte characters through 14.
0: 0::
1: 1:: 1
2: 2:: 12
3: 3:: 123
4: 4:: 1234
5: 5:: 12345
6: 5:: 12345
7: 5:: 12345
8: 8:: 12345"
9: 8:: 12345"
10: 8:: 12345"
11: 11:: 12345""
12: 12:: 12345""6
13: 13:: 12345""67
14: 14:: 12345""678
49 1
50 2
51 3
52 4
53 5
226 â
128 ?
156 ?
226 â
128 ?
157 ?
54 6
55 7
56 8
57 9
48 0
226 â
128 ?
157 ?
Return to end.
So that's kind of fun, and I'm in just before the question's five year anniversary. Though Oren's description of the bits had a small error, that's exactly the trick you want to use. Thanks for the question; neat.
Here are two possible solution - a LINQ one-liner processing the input left to right and a traditional for-loop processing the input from right to left. Which processing direction is faster depends on the string length, the allowed byte length, and the number and distribution of multibyte characters and is hard to give a general suggestion. The decision between LINQ and traditional code I probably a matter of taste (or maybe speed).
If speed matters, one could think about just accumulating the byte length of each character until reaching the maximum length instead of calculating the byte length of the whole string in each iteration. But I am not sure if this will work because I don't know UTF-8 encoding well enough. I could theoreticaly imagine that the byte length of a string does not equal the sum of the byte lengths of all characters.
public static String LimitByteLength(String input, Int32 maxLength)
{
return new String(input
.TakeWhile((c, i) =>
Encoding.UTF8.GetByteCount(input.Substring(0, i + 1)) <= maxLength)
.ToArray());
}
public static String LimitByteLength2(String input, Int32 maxLength)
{
for (Int32 i = input.Length - 1; i >= 0; i--)
{
if (Encoding.UTF8.GetByteCount(input.Substring(0, i + 1)) <= maxLength)
{
return input.Substring(0, i + 1);
}
}
return String.Empty;
}
Shorter version of ruffin's answer. Takes advantage of the design of UTF8:
public static string LimitUtf8ByteCount(this string s, int n)
{
// quick test (we probably won't be trimming most of the time)
if (Encoding.UTF8.GetByteCount(s) <= n)
return s;
// get the bytes
var a = Encoding.UTF8.GetBytes(s);
// if we are in the middle of a character (highest two bits are 10)
if (n > 0 && ( a[n]&0xC0 ) == 0x80)
{
// remove all bytes whose two highest bits are 10
// and one more (start of multi-byte sequence - highest bits should be 11)
while (--n > 0 && ( a[n]&0xC0 ) == 0x80)
;
}
// convert back to string (with the limit adjusted)
return Encoding.UTF8.GetString(a, 0, n);
}
All of the other answers appear to miss the fact that this functionality is already built into .NET, in the Encoder class. For bonus points, this approach will also work for other encodings.
public static string LimitByteLength(string message, int maxLength)
{
if (string.IsNullOrEmpty(message) || Encoding.UTF8.GetByteCount(message) <= maxLength)
{
return message;
}
var encoder = Encoding.UTF8.GetEncoder();
byte[] buffer = new byte[maxLength];
char[] messageChars = message.ToCharArray();
encoder.Convert(
chars: messageChars,
charIndex: 0,
charCount: messageChars.Length,
bytes: buffer,
byteIndex: 0,
byteCount: buffer.Length,
flush: false,
charsUsed: out int charsUsed,
bytesUsed: out int bytesUsed,
completed: out bool completed);
// I don't think we can return message.Substring(0, charsUsed)
// as that's the number of UTF-16 chars, not the number of codepoints
// (think about surrogate pairs). Therefore I think we need to
// actually convert bytes back into a new string
return Encoding.UTF8.GetString(buffer, 0, bytesUsed);
}
If you're using .NET Standard 2.1+, you can simplify it a bit:
public static string LimitByteLength(string message, int maxLength)
{
if (string.IsNullOrEmpty(message) || Encoding.UTF8.GetByteCount(message) <= maxLength)
{
return message;
}
var encoder = Encoding.UTF8.GetEncoder();
byte[] buffer = new byte[maxLength];
encoder.Convert(message.AsSpan(), buffer.AsSpan(), false, out _, out int bytesUsed, out _);
return Encoding.UTF8.GetString(buffer, 0, bytesUsed);
}
None of the other answers account for extended grapheme clusters, such as 👩🏽‍🚒. This is composed of 4 Unicode scalars (👩, 🏽, a zero-width joiner, and 🚒), so you need knowledge of the Unicode standard to avoid splitting it in the middle and producing 👩 or 👩🏽.
In .NET 5 onwards, you can write this as:
public static string LimitByteLength(string message, int maxLength)
{
if (string.IsNullOrEmpty(message) || Encoding.UTF8.GetByteCount(message) <= maxLength)
{
return message;
}
var enumerator = StringInfo.GetTextElementEnumerator(message);
var result = new StringBuilder();
int lengthBytes = 0;
while (enumerator.MoveNext())
{
lengthBytes += Encoding.UTF8.GetByteCount(enumerator.GetTextElement());
if (lengthBytes <= maxLength)
{
result.Append(enumerator.GetTextElement());
}
}
return result.ToString();
}
(This same code runs on earlier versions of .NET, but due to a bug it won't produce the correct result before .NET 5).
If a UTF-8 byte has a zero-valued high order bit, it's the beginning of a character. If its high order bit is 1, it's in the 'middle' of a character. The ability to detect the beginning of a character was an explicit design goal of UTF-8.
Check out the Description section of the wikipedia article for more detail.
Is there a reason that you need the database column to be declared in terms of bytes? That's the default, but it's not a particularly useful default if the database character set is variable width. I'd strongly prefer declaring the column in terms of characters.
CREATE TABLE length_example (
col1 VARCHAR2( 10 BYTE ),
col2 VARCHAR2( 10 CHAR )
);
This will create a table where COL1 will store 10 bytes of data and col2 will store 10 characters worth of data. Character length semantics make far more sense in a UTF8 database.
Assuming you want all the tables you create to use character length semantics by default, you can set the initialization parameter NLS_LENGTH_SEMANTICS to CHAR. At that point, any tables you create will default to using character length semantics rather than byte length semantics if you don't specify CHAR or BYTE in the field length.
Following Oren Trutner's comment here are two more solutions to the problem:
here we count the number of bytes to remove from the end of the string according to each character at the end of the string, so we don't evaluate the entire string in every iteration.
string str = "朣楢琴执执 瑩浻牡楧硰执执獧浻牡楧敬瑦 瀰 絸朣杢执獧扻捡杫潲湵 潣"
int maxBytesLength = 30;
var bytesArr = Encoding.UTF8.GetBytes(str);
int bytesToRemove = 0;
int lastIndexInString = str.Length -1;
while(bytesArr.Length - bytesToRemove > maxBytesLength)
{
bytesToRemove += Encoding.UTF8.GetByteCount(new char[] {str[lastIndexInString]} );
--lastIndexInString;
}
string trimmedString = Encoding.UTF8.GetString(bytesArr,0,bytesArr.Length - bytesToRemove);
//Encoding.UTF8.GetByteCount(trimmedString);//get the actual length, will be <= 朣楢琴执执 瑩浻牡楧硰执执獧浻牡楧敬瑦 瀰 絸朣杢执獧扻捡杫潲湵 潣潬昣昸昸慢正
And an even more efficient(and maintainable) solution:
get the string from the bytes array according to desired length and cut the last character because it might be corrupted
string str = "朣楢琴执执 瑩浻牡楧硰执执獧浻牡楧敬瑦 瀰 絸朣杢执獧扻捡杫潲湵 潣"
int maxBytesLength = 30;
string trimmedWithDirtyLastChar = Encoding.UTF8.GetString(Encoding.UTF8.GetBytes(str),0,maxBytesLength);
string trimmedString = trimmedWithDirtyLastChar.Substring(0,trimmedWithDirtyLastChar.Length - 1);
The only downside with the second solution is that we might cut a perfectly fine last character, but we are already cutting the string, so it might fit with the requirements.
Thanks to Shhade who thought about the second solution
This is another solution based on binary search:
public string LimitToUTF8ByteLength(string text, int size)
{
if (size <= 0)
{
return string.Empty;
}
int maxLength = text.Length;
int minLength = 0;
int length = maxLength;
while (maxLength >= minLength)
{
length = (maxLength + minLength) / 2;
int byteLength = Encoding.UTF8.GetByteCount(text.Substring(0, length));
if (byteLength > size)
{
maxLength = length - 1;
}
else if (byteLength < size)
{
minLength = length + 1;
}
else
{
return text.Substring(0, length);
}
}
// Round down the result
string result = text.Substring(0, length);
if (size >= Encoding.UTF8.GetByteCount(result))
{
return result;
}
else
{
return text.Substring(0, length - 1);
}
}
public static string LimitByteLength3(string input, Int32 maxLenth)
{
string result = input;
int byteCount = Encoding.UTF8.GetByteCount(input);
if (byteCount > maxLenth)
{
var byteArray = Encoding.UTF8.GetBytes(input);
result = Encoding.UTF8.GetString(byteArray, 0, maxLenth);
}
return result;
}

Categories