I got a boolean list with 92 booleans, I want the list to be converted to a string, I thought I ll take 8 booleans(bits) and put them in a Byte(8 bits) and then use the ASCII to convert it the byte value to a char then add the chars to a string. However after googeling for more then 2 hours, no luck atm. I tried converting the List to a Byte list but it didn t work either ^^.
String strbyte = null;
for (int x = 0; x != tmpboolist.Count; x++) //tmpboolist is the 90+- boolean list
{
//this loop checks for true then puts a 1 or a 0 in the string(strbyte)
if (tmpboolist[x])
{
strbyte = strbyte + '1';
}
else
{
strbyte = strbyte + '0';
}
}
//here I try to convert the string to a byte list but no success
//no success because the testbytearray has the SAME size as the
//tmpboolist(but it should have less since 8 booleans should be 1 Byte)
//however all the 'Bytes' are 48 & 49 (which is 1 and 0 according to
//http://www.asciitable.com/)
Byte[] testbytearray = Encoding.Default.GetBytes(strbyte);
PS If anyone has a better suggestion on how to code & decode a Boolean list to a String?
(Because I want people to share their boolean list with a string rather then a list of 90 1 and 0s.)
EDIT: got it working now! ty all for helping
string text = new string(tmpboolist.Select(x => x ? '1' : '0').ToArray());
byte[] bytes = getBitwiseByteArray(text); //http://stackoverflow.com/a/6756231/1184013
String Arraycode = Convert.ToBase64String(bytes);
System.Windows.MessageBox.Show(Arraycode);
//first it makes a string out of the boolean list then it uses the converter to make it an Byte[](array), then we use the base64 encoding to make the byte[] a String.(that can be decoded later)
I ll look into the encoding32 later, ty for all the help again :)
You should store your boolean values in a BitArray.
var values = new BitArray(92);
values[0] = false;
values[1] = true;
values[2] = true;
...
Then you can convert the BitArray to a byte array
var bytes = new byte[(values.Length + 7) / 8];
values.CopyTo(bytes);
and the byte array to a Base64 string
var result = Convert.ToBase64String(bytes);
Reversely, you can convert a Base64 string to a byte array
var bytes2 = Convert.FromBase64String(result);
and the byte array to a BitArray
var values2 = new BitArray(bytes2);
The Base64 string looks like this: "Liwd7bRv6TMY2cNE". This is probably a bit unhandy for sharing between people; have a look at human-oriented base-32 encoding:
Anticipated uses of these [base-32 strings] include cut-
and-paste, text editing (e.g. in HTML files), manual transcription via a
keyboard, manual transcription via pen-and-paper, vocal transcription over
phone or radio, etc.
The desiderata for such an encoding are:
minimizing transcription errors -- e.g. the well-known problem of confusing
'0' with 'O'
embedding into other structures -- e.g. search engines, structured or
marked-up text, file systems, command shells
brevity -- Shorter [strings] are better than longer ones.
ergonomics -- Human users (especially non-technical ones) should find the
[strings] as easy and pleasant as possible. The uglier the [strings] looks, the worse.
To start with, it's a bad idea to concatenate strings in a loop like that - at least use StringBuilder, or use something like this with LINQ:
string text = new string(tmpboolist.Select(x => x ? '1' : '0').ToArray());
But converting your string to a List<bool> is easy with LINQ, using the fact that string implements IEnumerable<char>:
List<bool> values = text.Select(c => c == '1').ToList();
It's not clear where the byte array comes in... but you should not try to represent arbitrary binary data in a string just using Encoding.GetString. That's not what it's for.
If you don't care what format your string uses, then using Base64 will work well - but be aware that if you're grouping your Boolean values into bytes, you'll need extra information if you need to distinguish between "7 values" and "8 values, the first of which is False" for example.
Since I am infering from your code you want a string with n digits of either 1 or 0 depending onthe internal lists bool value then how about...
public override string ToString()
{
StringBuilder output = new StringBuilder(91);
foreach(bool item in this.tempboolist)
{
output.Append(item ? "1" : "0");
}
return output.ToString();
}
Warning this was off the cuff typing, I have not validated this with a compiler yet!
This function does what you want:
public String convertBArrayToStr(bool[] input)
{
if (input == null)
return "";
int length = input.Count();
int byteArrayCount = (input.Count() - 1) / 8 + 1;
var bytes = new char[byteArrayCount];
for (int i = 0; i < length; i++ )
{
var mappedIndex = (i - 1) / 8;
bytes[mappedIndex] = (char)(2 * bytes[mappedIndex] +(input[i] == true ? 1 : 0));
}
return new string(bytes);
}
Related
I have trouble to read data in DB of a Siemens PLC S7 1500 using S7netplus.
The situation:
I have a C# application running.
I connect on the PLC very well.
I can read data such as Boolean, UInt, UShot, Bytes
But I don't know how to read String data (see the image below)
To read the other datas like Boolean I use this call:
plc.Read("DB105.DBX0.0")
I understood that this read in the Datablock 105 (DB105) with a datatype Boolean (DBX) at the offset 0.0
I would like to apply the same type of reading for the string. So I Tried "DB105.DBB10.0" in my example. But it return a value "40" in Byte type (and i should have something else)
I saw that there is another reading method
plc.ReadBytes(DataType DB, int DBNumber, int StartByteArray, int lengthToRead)
But I have difficulties to see how to apply it to my example (I know that I have to convert it to string after).
TO resume:
- Is there a simple way with a string like "DB105.DBX0.0" to read string data in a Siemens PLC?
- If not how to use the ReadBytes function in my example?
Thanks for your help
I managed to read my string value by the ReadBytes method.
In my example I needed to pass values like this:
plc.Read(DataType.DataBlock, 105, 12, VarType.String, 40);
Why 12? Because the 2 first octets of a byte string are for the length. So 10 to 12 return a value as 40 which is the length.
I have override the read method to accept the 'easy string' call like this:
public T Read<T>(object pValue)
{
var splitValue = pValue.ToString().Split('.');
//check if it is a string template (3 separation ., 2 if not)
if (splitValue.Count() > 3 && splitValue[1].Substring(2, 1) == "S")
{
DataType dType;
//If we have to read string in other dataType need development to make here.
if (splitValue[0].Substring(0, 2) == "DB")
dType = DataType.DataBlock;
else
throw new Exception("Data Type not supported for string value yet.");
int length = Convert.ToInt32(splitValue[3]);
int start = Convert.ToInt32(splitValue[1].Substring(3, splitValue[1].Length - 3));
int MemoryNumber = Convert.ToInt32(splitValue[0].Substring(2, splitValue[0].Length - 2));
// the 2 first bits are for the length of the string. So we have to pass it
int startString = start + 2;
var value = ReadFull(dType, MemoryNumber, startString, VarType.String, length);
return (T)value;
}
else
{
var value = plc.Read(pValue.ToString());
//Cast with good format.
return (T)value;
}
}
So now I can call my read function like this:
with basic existing call:
var element = mPlc.Read<bool>("DB10.DBX1.4").ToString(); => read in Datablock 10 a boolean value on the byte 1 and octet 4
var element = mPlc.Read<uint>("DB10.DBD4.0").ToString(); => read in datablock 10 a int value on the byte 4 and octet 0
with the overrided call for the string:
var element = mPlc.Read<string>("DB105.DBS10.0.40").ToString() => read in the datablock 105 a string value on the byte 10 and octet 0 with a length of 40
Hope this could help for anyone else :)
I did it slightly simpler; I ignore the first byte, and then read the second byte to give me the string length. I then use this to give me the length of the bytes for the string. For example the PLC gave me DB offset of 288 for the start of the string. This is using the S7Plus NuGet, with a DB address of 666.
Note, requesting strings seriously slows down the communication, so probably better to only request them when there is a new value.
TempStringLength(0) = PLC.Read(DataType.DataBlock, 666, 289, VarType.Byte, 1) 'Length of String.'
TempStringArray(0) = PLC.Read(DataType.DataBlock, 666, 290, VarType.String, TempStringLength(0))'Actual String.'
I am writing a program that reads '.exe' files and stores their hex values in an array of bytes for comparison with an array containing a series of values. (like a very simple virus scanner)
byte[] buffer = File.ReadAllBytes(currentDirectoryContents[j]);
I have then used BitConverter to create a single string of these values
string hex = BitConverter.ToString(buffer);
The next step is to search this string for a series of values(definitions) and return positive for a match. This is where I am running into problems. My definitions are hex values but created and saved in notepad as defintions.xyz
string[] definitions = File.ReadAllLines(#"C:\definitions.xyz");
I had been trying to read them into a string array and compare the definition elements of the array with string hex
bool[] test = new bool[currentDirectoryContents.Length];
test[j] = hex.Contains(definitions[i]);
This IS a section from a piece of homework, which is why I am not posting my entire code for the program. I had not used C# before last Friday so am most likely making silly mistakes at this point.
Any advice much appreciated :)
It is pretty unclear exactly what kind of format you use of the definitions. Base64 is a good encoding for a byte[], you can rapidly convert back and forth with Convert.ToBase64String and Convert.FromBase64String(). But your question suggests the bytes are encoded in hex. Let's assume it looks like "01020304" for a new byte[] { 1, 2, 3, 4}. Then this helper function converts such a string back to a byte[]:
static byte[] Hex2Bytes(string hex) {
if (hex.Length % 2 != 0) throw new ArgumentException();
var retval = new byte[hex.Length / 2];
for (int ix = 0; ix < hex.Length; ix += 2) {
retval[ix / 2] = byte.Parse(hex.Substring(ix, 2), System.Globalization.NumberStyles.HexNumber);
}
return retval;
}
You can now do a fast pattern search with an algorithm like Boyer-Moore.
I expect you understand that this is a very inefficient way to do it. But except for that, you should just do something like this:
bool[] test = new bool[currentDirectoryContents.Length];
for(int i=0;i<test.Length;i++){
byte[] buffer = File.ReadAllBytes(currentDirectoryContents[j]);
string hex = BitConverter.ToString(buffer);
test[i] = ContainsAny(hex, definitions);
}
bool ContainsAny(string s, string[] values){
foreach(string value in values){
if(s.Contains(value){
return true;
}
}
return false;
}
If you can use LINQ, you can do it like this:
var test = currentDirectoryContents.Select(
file=>definitions.Any(
definition =>
BitConverter.ToString(
File.ReadAllBytes(file)
).Contains(definition)
)
).ToArray();
Also, make sure that your definitions-file is formatted in a way that matches the output of BitConverter.ToString(): upper-case with dashes separating each encoded byte:
12-AB-F0-34
54-AC-FF-01-02
I'm converting a List<string> into a byte array like this:
Byte[] bArray = userList
.SelectMany(s => System.Text.Encoding.ASCII.GetByte(s))
.ToArray();
How can I convert it back to a List<string>? I tried using ASCII.GetString(s) in the code above, but GetString expected a byte[], not a single byte.
It's not possible to reverse your algorithm.
The problem can be seen if you consider what happens when you have two users called "ab" and "c". This will give the exact same bytes as if you have two users called "a" and "bc". There is no way to distinguish between these two cases with your approach.
Instead of inventing your own serialization format you could just the serialization that is built into the .NET framework, such as the BinaryFormatter.
As a bit of a sidenote, if you preserve the zero-byte string termination you can easily concatenate the strings and extract all information, e.g.
Byte[] bArray = userList
.SelectMany(s => System.Text.Encoding.ASCII.GetBytes(s + '\0')) // Add 0 byte
.ToArray();
List<string> names = new List<string>();
for (int i = 0; i < bArray.Length; i++)
{
int end = i;
while (bArray[end] != 0) // Scan for zero byte
end++;
var length = end - i;
var word = new byte[length];
Array.Copy(bArray, i, word, 0, length);
names.Add(ASCIIEncoding.ASCII.GetString(word));
i += length;
}
You need to insert a delimter between your strings so that you can split the big byte array back into the original users. The delimiter should be a character which cannot be part of a user name.
Example (assuming | cannot be part of a user name):
var bytes = System.Text.Encoding.ASCII.GetByte(string.Join("|", userList.ToArray()));
You can't do this since the delimiters of the array structure were lost in the SelectMany method.
What I need is very simple, but before I reinvent the wheel, I would like to know if something similar exist in the framework already.
I would like to encode (and decode) strings from a predefined characters table. I have many strings that contains few characters. Here is a string I would like to encode:
cn=1;pl=23;vf=3;vv=0
This string size is 20 chars, so 20 bytes.
In the string, I only use the following characters: cn=1;p23vf0
A total of 11 characters. So each character can be encoded with 4 bits only isn't ? Reducing the total amount of bytes used to 10.
Is there any existing method in .NET that can take a string in parameter and the reference table array and return the encoded bytes ?
char[] reference = "cn=1;p23vf0".ToCharArray();
string input = "cn=1;pl=23;vf=3;vv=0";
byte[] encoded = someClass.Encode(input, reference);
string decoded = someClass.Decode(encoded, reference);
Assert.AreEqual(input, decoded);
Any compression algorithm uses Huffman encoding. Which is basically what you are looking for here. That encoding isn't exposed as a class separately, it is part of the algorithm of the DeflateStream and GZipStream classes. Which is what you ought to use, as long as your strings are a reasonable size. If they are short then there isn't any point in encoding them.
Interresting question... There isn't anything built in the framework, but it can be done for example like this:
public static byte[] Encode(string input, string reference) {
int size = 1;
while ((1 << ++size) < reference.Length);
byte[] result = new byte[(size * input.Length + 7) / 8];
new BitArray(
input
.Select(c => {
int index = reference.IndexOf(c);
return Enumerable.Range(0, size).Select(i => (index & (1 << i)) != 0);
})
.SelectMany(a => a)
.ToArray()
).CopyTo(result, 0);
return result;
}
public static string Decode(byte[] encoded, int length, string reference) {
int size = 1;
while ((1 << ++size) < reference.Length);
return new String(
new BitArray(encoded)
.Cast<bool>()
.Take(length * size)
.Select((b, i) => new { Index = i / size, Bit = b })
.GroupBy(g => g.Index)
.Select(g => reference[g.Select((b, i) => (b.Bit ? 1 : 0) << i).Sum()])
.ToArray()
);
}
The code is a bit complicated, but that is because it works with any number of bits per character, not just four.
You encode the string like in your question, except the string contains twelve different characters, not eleven:
string reference = "cn=1;pl23vf0";
string input = "cn=1;pl=23;vf=3;vv=0";
byte[] encoded = Encode(input, reference);
To decode the string you also need the length of the original string, as that is impossible to tell from the length of the encoded data:
string decoded = Decode(encoded, input.Length, reference);
(Alternatively to supplying the length you could of course introduce an EOF character, or a padding character similar to how base64 pads the data.)
There's no out-of-the-box class that does exactly this, but it's not too hard using the BitArray class of .NET.
Once you have a bit-array, you can convert it to a string, or a packed byte representation.
// modify this as appropriate to divide your original input string...
public IEnumerable<string> Divide( string s )
{
for( int i = 0; i < s.Length; i += 2 )
yield return s.Substring( i, 2 );
}
public IEnumerable<bool> AsBoolArray( byte b )
{
var i = 4; // assume we only want 4-bits
while( i-- > 0 )
{
yield return (b & 0x01) != 0;
b >>= 1;
}
}
// define your own mapping table...
var mappingTable =
new Dictionary<string,int>() { {"cn", 1}, {"pl",23}, {"vf",3}, {"vv",0} /*...*/ };
var originalString = "cncnvfvvplvvplpl";
// encode the data by mapping each string to the dictionary...
var encodedData = DivideString( originalString ).Select( s => mappingTable[s] );
// then convert into a bitVector based on the boolean representation of each value...
// The AsBoolArray() method return the 4-bit encoded bool[] for each value
var packedBitVector =
new BitArray( encodedData.Select( x => AsBoolArray(x) ).ToArray() );
// you can use BitArray.CopyTo() to get the representation out as a packed int[]
I think if you want to minimize size of string it's better to use System.IO.Compression.GZipStream here. It's very simple and will likely to compress your string much more than 2 times.
There is nothing like that built into the Base Class Library. You will have to build your own.
Take a look at the Encoder class from System.Text - some elements may be of help.
Would the StringBuilder class be of any help?
You can use the CrytpAPI. here is a good example, including the methods to Encrypt and Decrypt a string. I don't think it will "compress" your data for you, though.
A recent project called for importing data into an Oracle database. The program that will do this is a C# .Net 3.5 app and I'm using the Oracle.DataAccess connection library to handle the actual inserting.
I ran into a problem where I'd receive this error message when inserting a particular field:
ORA-12899 Value too large for column X
I used Field.Substring(0, MaxLength); but still got the error (though not for every record).
Finally I saw what should have been obvious, my string was in ANSI and the field was UTF8. Its length is defined in bytes, not characters.
This gets me to my question. What is the best way to trim my string to fix the MaxLength?
My substring code works by character length. Is there simple C# function that can trim a UT8 string intelligently by byte length (ie not hack off half a character) ?
I think we can do better than naively counting the total length of a string with each addition. LINQ is cool, but it can accidentally encourage inefficient code. What if I wanted the first 80,000 bytes of a giant UTF string? That's a lot of unnecessary counting. "I've got 1 byte. Now I've got 2. Now I've got 13... Now I have 52,384..."
That's silly. Most of the time, at least in l'anglais, we can cut exactly on that nth byte. Even in another language, we're less than 6 bytes away from a good cutting point.
So I'm going to start from #Oren's suggestion, which is to key off of the leading bit of a UTF8 char value. Let's start by cutting right at the n+1th byte, and use Oren's trick to figure out if we need to cut a few bytes earlier.
Three possibilities
If the first byte after the cut has a 0 in the leading bit, I know I'm cutting precisely before a single byte (conventional ASCII) character, and can cut cleanly.
If I have a 11 following the cut, the next byte after the cut is the start of a multi-byte character, so that's a good place to cut too!
If I have a 10, however, I know I'm in the middle of a multi-byte character, and need to go back to check to see where it really starts.
That is, though I want to cut the string after the nth byte, if that n+1th byte comes in the middle of a multi-byte character, cutting would create an invalid UTF8 value. I need to back up until I get to one that starts with 11 and cut just before it.
Code
Notes: I'm using stuff like Convert.ToByte("11000000", 2) so that it's easy to tell what bits I'm masking (a little more about bit masking here). In a nutshell, I'm &ing to return what's in the byte's first two bits and bringing back 0s for the rest. Then I check the XX from XX000000 to see if it's 10 or 11, where appropriate.
I found out today that C# 6.0 might actually support binary representations, which is cool, but we'll keep using this kludge for now to illustrate what's going on.
The PadLeft is just because I'm overly OCD about output to the Console.
So here's a function that'll cut you down to a string that's n bytes long or the greatest number less than n that's ends with a "complete" UTF8 character.
public static string CutToUTF8Length(string str, int byteLength)
{
byte[] byteArray = Encoding.UTF8.GetBytes(str);
string returnValue = string.Empty;
if (byteArray.Length > byteLength)
{
int bytePointer = byteLength;
// Check high bit to see if we're [potentially] in the middle of a multi-byte char
if (bytePointer >= 0
&& (byteArray[bytePointer] & Convert.ToByte("10000000", 2)) > 0)
{
// If so, keep walking back until we have a byte starting with `11`,
// which means the first byte of a multi-byte UTF8 character.
while (bytePointer >= 0
&& Convert.ToByte("11000000", 2) != (byteArray[bytePointer] & Convert.ToByte("11000000", 2)))
{
bytePointer--;
}
}
// See if we had 1s in the high bit all the way back. If so, we're toast. Return empty string.
if (0 != bytePointer)
{
returnValue = Encoding.UTF8.GetString(byteArray, 0, bytePointer); // hat tip to #NealEhardt! Well played. ;^)
}
}
else
{
returnValue = str;
}
return returnValue;
}
I initially wrote this as a string extension. Just add back the this before string str to put it back into extension format, of course. I removed the this so that we could just slap the method into Program.cs in a simple console app to demonstrate.
Test and expected output
Here's a good test case, with the output it create below, written expecting to be the Main method in a simple console app's Program.cs.
static void Main(string[] args)
{
string testValue = "12345“”67890”";
for (int i = 0; i < 15; i++)
{
string cutValue = Program.CutToUTF8Length(testValue, i);
Console.WriteLine(i.ToString().PadLeft(2) +
": " + Encoding.UTF8.GetByteCount(cutValue).ToString().PadLeft(2) +
":: " + cutValue);
}
Console.WriteLine();
Console.WriteLine();
foreach (byte b in Encoding.UTF8.GetBytes(testValue))
{
Console.WriteLine(b.ToString().PadLeft(3) + " " + (char)b);
}
Console.WriteLine("Return to end.");
Console.ReadLine();
}
Output follows. Notice that the "smart quotes" in testValue are three bytes long in UTF8 (though when we write the chars to the console in ASCII, it outputs dumb quotes). Also note the ?s output for the second and third bytes of each smart quote in the output.
The first five characters of our testValue are single bytes in UTF8, so 0-5 byte values should be 0-5 characters. Then we have a three-byte smart quote, which can't be included in its entirety until 5 + 3 bytes. Sure enough, we see that pop out at the call for 8.Our next smart quote pops out at 8 + 3 = 11, and then we're back to single byte characters through 14.
0: 0::
1: 1:: 1
2: 2:: 12
3: 3:: 123
4: 4:: 1234
5: 5:: 12345
6: 5:: 12345
7: 5:: 12345
8: 8:: 12345"
9: 8:: 12345"
10: 8:: 12345"
11: 11:: 12345""
12: 12:: 12345""6
13: 13:: 12345""67
14: 14:: 12345""678
49 1
50 2
51 3
52 4
53 5
226 â
128 ?
156 ?
226 â
128 ?
157 ?
54 6
55 7
56 8
57 9
48 0
226 â
128 ?
157 ?
Return to end.
So that's kind of fun, and I'm in just before the question's five year anniversary. Though Oren's description of the bits had a small error, that's exactly the trick you want to use. Thanks for the question; neat.
Here are two possible solution - a LINQ one-liner processing the input left to right and a traditional for-loop processing the input from right to left. Which processing direction is faster depends on the string length, the allowed byte length, and the number and distribution of multibyte characters and is hard to give a general suggestion. The decision between LINQ and traditional code I probably a matter of taste (or maybe speed).
If speed matters, one could think about just accumulating the byte length of each character until reaching the maximum length instead of calculating the byte length of the whole string in each iteration. But I am not sure if this will work because I don't know UTF-8 encoding well enough. I could theoreticaly imagine that the byte length of a string does not equal the sum of the byte lengths of all characters.
public static String LimitByteLength(String input, Int32 maxLength)
{
return new String(input
.TakeWhile((c, i) =>
Encoding.UTF8.GetByteCount(input.Substring(0, i + 1)) <= maxLength)
.ToArray());
}
public static String LimitByteLength2(String input, Int32 maxLength)
{
for (Int32 i = input.Length - 1; i >= 0; i--)
{
if (Encoding.UTF8.GetByteCount(input.Substring(0, i + 1)) <= maxLength)
{
return input.Substring(0, i + 1);
}
}
return String.Empty;
}
Shorter version of ruffin's answer. Takes advantage of the design of UTF8:
public static string LimitUtf8ByteCount(this string s, int n)
{
// quick test (we probably won't be trimming most of the time)
if (Encoding.UTF8.GetByteCount(s) <= n)
return s;
// get the bytes
var a = Encoding.UTF8.GetBytes(s);
// if we are in the middle of a character (highest two bits are 10)
if (n > 0 && ( a[n]&0xC0 ) == 0x80)
{
// remove all bytes whose two highest bits are 10
// and one more (start of multi-byte sequence - highest bits should be 11)
while (--n > 0 && ( a[n]&0xC0 ) == 0x80)
;
}
// convert back to string (with the limit adjusted)
return Encoding.UTF8.GetString(a, 0, n);
}
All of the other answers appear to miss the fact that this functionality is already built into .NET, in the Encoder class. For bonus points, this approach will also work for other encodings.
public static string LimitByteLength(string message, int maxLength)
{
if (string.IsNullOrEmpty(message) || Encoding.UTF8.GetByteCount(message) <= maxLength)
{
return message;
}
var encoder = Encoding.UTF8.GetEncoder();
byte[] buffer = new byte[maxLength];
char[] messageChars = message.ToCharArray();
encoder.Convert(
chars: messageChars,
charIndex: 0,
charCount: messageChars.Length,
bytes: buffer,
byteIndex: 0,
byteCount: buffer.Length,
flush: false,
charsUsed: out int charsUsed,
bytesUsed: out int bytesUsed,
completed: out bool completed);
// I don't think we can return message.Substring(0, charsUsed)
// as that's the number of UTF-16 chars, not the number of codepoints
// (think about surrogate pairs). Therefore I think we need to
// actually convert bytes back into a new string
return Encoding.UTF8.GetString(buffer, 0, bytesUsed);
}
If you're using .NET Standard 2.1+, you can simplify it a bit:
public static string LimitByteLength(string message, int maxLength)
{
if (string.IsNullOrEmpty(message) || Encoding.UTF8.GetByteCount(message) <= maxLength)
{
return message;
}
var encoder = Encoding.UTF8.GetEncoder();
byte[] buffer = new byte[maxLength];
encoder.Convert(message.AsSpan(), buffer.AsSpan(), false, out _, out int bytesUsed, out _);
return Encoding.UTF8.GetString(buffer, 0, bytesUsed);
}
None of the other answers account for extended grapheme clusters, such as 👩🏽🚒. This is composed of 4 Unicode scalars (👩, 🏽, a zero-width joiner, and 🚒), so you need knowledge of the Unicode standard to avoid splitting it in the middle and producing 👩 or 👩🏽.
In .NET 5 onwards, you can write this as:
public static string LimitByteLength(string message, int maxLength)
{
if (string.IsNullOrEmpty(message) || Encoding.UTF8.GetByteCount(message) <= maxLength)
{
return message;
}
var enumerator = StringInfo.GetTextElementEnumerator(message);
var result = new StringBuilder();
int lengthBytes = 0;
while (enumerator.MoveNext())
{
lengthBytes += Encoding.UTF8.GetByteCount(enumerator.GetTextElement());
if (lengthBytes <= maxLength)
{
result.Append(enumerator.GetTextElement());
}
}
return result.ToString();
}
(This same code runs on earlier versions of .NET, but due to a bug it won't produce the correct result before .NET 5).
If a UTF-8 byte has a zero-valued high order bit, it's the beginning of a character. If its high order bit is 1, it's in the 'middle' of a character. The ability to detect the beginning of a character was an explicit design goal of UTF-8.
Check out the Description section of the wikipedia article for more detail.
Is there a reason that you need the database column to be declared in terms of bytes? That's the default, but it's not a particularly useful default if the database character set is variable width. I'd strongly prefer declaring the column in terms of characters.
CREATE TABLE length_example (
col1 VARCHAR2( 10 BYTE ),
col2 VARCHAR2( 10 CHAR )
);
This will create a table where COL1 will store 10 bytes of data and col2 will store 10 characters worth of data. Character length semantics make far more sense in a UTF8 database.
Assuming you want all the tables you create to use character length semantics by default, you can set the initialization parameter NLS_LENGTH_SEMANTICS to CHAR. At that point, any tables you create will default to using character length semantics rather than byte length semantics if you don't specify CHAR or BYTE in the field length.
Following Oren Trutner's comment here are two more solutions to the problem:
here we count the number of bytes to remove from the end of the string according to each character at the end of the string, so we don't evaluate the entire string in every iteration.
string str = "朣楢琴执执 瑩浻牡楧硰执执獧浻牡楧敬瑦 瀰 絸朣杢执獧扻捡杫潲湵 潣"
int maxBytesLength = 30;
var bytesArr = Encoding.UTF8.GetBytes(str);
int bytesToRemove = 0;
int lastIndexInString = str.Length -1;
while(bytesArr.Length - bytesToRemove > maxBytesLength)
{
bytesToRemove += Encoding.UTF8.GetByteCount(new char[] {str[lastIndexInString]} );
--lastIndexInString;
}
string trimmedString = Encoding.UTF8.GetString(bytesArr,0,bytesArr.Length - bytesToRemove);
//Encoding.UTF8.GetByteCount(trimmedString);//get the actual length, will be <= 朣楢琴执执 瑩浻牡楧硰执执獧浻牡楧敬瑦 瀰 絸朣杢执獧扻捡杫潲湵 潣潬昣昸昸慢正
And an even more efficient(and maintainable) solution:
get the string from the bytes array according to desired length and cut the last character because it might be corrupted
string str = "朣楢琴执执 瑩浻牡楧硰执执獧浻牡楧敬瑦 瀰 絸朣杢执獧扻捡杫潲湵 潣"
int maxBytesLength = 30;
string trimmedWithDirtyLastChar = Encoding.UTF8.GetString(Encoding.UTF8.GetBytes(str),0,maxBytesLength);
string trimmedString = trimmedWithDirtyLastChar.Substring(0,trimmedWithDirtyLastChar.Length - 1);
The only downside with the second solution is that we might cut a perfectly fine last character, but we are already cutting the string, so it might fit with the requirements.
Thanks to Shhade who thought about the second solution
This is another solution based on binary search:
public string LimitToUTF8ByteLength(string text, int size)
{
if (size <= 0)
{
return string.Empty;
}
int maxLength = text.Length;
int minLength = 0;
int length = maxLength;
while (maxLength >= minLength)
{
length = (maxLength + minLength) / 2;
int byteLength = Encoding.UTF8.GetByteCount(text.Substring(0, length));
if (byteLength > size)
{
maxLength = length - 1;
}
else if (byteLength < size)
{
minLength = length + 1;
}
else
{
return text.Substring(0, length);
}
}
// Round down the result
string result = text.Substring(0, length);
if (size >= Encoding.UTF8.GetByteCount(result))
{
return result;
}
else
{
return text.Substring(0, length - 1);
}
}
public static string LimitByteLength3(string input, Int32 maxLenth)
{
string result = input;
int byteCount = Encoding.UTF8.GetByteCount(input);
if (byteCount > maxLenth)
{
var byteArray = Encoding.UTF8.GetBytes(input);
result = Encoding.UTF8.GetString(byteArray, 0, maxLenth);
}
return result;
}