Convert string of integers to ASCII chars - c#

What is the best way to convert a string of digits into their equivalent ASCII characters?
I think that I am over-complicating this.
Console.WriteLine($"Enter the word to decrypt: ");
//store the values to convert into a string
string vWord = Console.ReadLine();
for (int i = 0; i < vWord.Length; i++)
{
int convertedIndex = vWord[i];
char character = (char)convertedIndex;
finalValue += character.ToString();
Console.WriteLine($"Input: {vWord[i]} Index: {convertedIndex} Char {character}");
}

If the expected input values are something like this: 65 66 67 97 98 99, you could just split the input and cast the converted int values to char:
string vWord = "65 66 67 97 98 99";
string result = string.Join("", vWord.Split().Select(n => (char)(int.Parse(n))));
Console.WriteLine($"Result string: {result}");
This method, however, doesn't perform any error checking on the input string. When dealing with user input, this is not a great idea. We better use int.TryParse() to validate the input parts:
var result = new StringBuilder();
var ASCIIValues = vWord.Split();
foreach (string CharValue in ASCIIValues) {
if (int.TryParse(CharValue, out int n) && n < 127) {
result.Append((char)n);
}
else {
Console.WriteLine($"{CharValue} is not a vaid input");
break;
}
}
Console.WriteLine($"Result string: {result.ToString()}");
You could also use the Encoding.ASCII.GetString method to convert to string the Byte array generated by the byte.Parse method. For example, using LINQ's Select:
string vWord = "65 66 67 97 98 267";
try
{
var CharArray = vWord.Split().Select(n => byte.Parse(n)).ToArray();
string result = Encoding.ASCII.GetString(CharArray);
Console.WriteLine($"String result: {result}");
}
catch (Exception)
{
Console.WriteLine("Not a vaid input");
}
This will print "Not a vaid input", because one of the value is > 255.
Should you decide to allow an input string composed of contiguous values:
651016667979899112101 => "AeBCabcpe"
You could adopt this variation:
string vWord2 = "11065666797989911210110177";
int step = 2;
var result2 = new StringBuilder();
for (int i = 0; i < vWord2.Length; i += step)
{
if (int.TryParse(vWord2.Substring(i, step), out int n) && n < 127)
{
if (n <= 12 & i == 0) {
i = -3; step = 3; ;
}
else if(n <= 12 & i >= 2) {
step = 3; i -= step;
}
else {
result2.Append((char)n);
if (step == 3) ++i;
step = 2;
}
}
else {
Console.WriteLine($"{vWord2.Substring(i, step)} is not a vaid input");
break;
}
}
Console.WriteLine($"Result string: {result2.ToString()}");
Result string: nABCabcpeeM
As Tom Blodget requested, a note about the automatic conversion
between ASCII characters-set and Unicode CodePoints.
This code produces some ASCII characters using an integer value, corresponding to the character in the ASCII table, casting the value to a char type and converting the result to a Windows standard Unicode (UTF-16LE) string.
Why there's no need to explicitly convert the ASCII chars to their Unicode representation?
Because, for historical reasons, the lower Unicode CodePoints directly map to the standard ASCII table (the US-ASCII table).
Hence, no conversion is required, or it can be considered implicit.
But, since the .Net string type uses UTF-16LE Unicode internally (which uses a 16-bit unit for each character in the lower Plane, two 16-bit code units for CodePoints greater or equal to 216), the memory allocation in bytes for the string is double the number of characters.
In the .Net Reference Source, StringBuilder.ToString() will call the internal wstrcpy method:
wstrcpy(char *dmem, char *smem, int charCount)
which will then call Buffer.Memcpy:
Buffer.Memcpy((byte*)dmem, (byte*)smem, charCount * 2);
where the size in bytes is set to charCount * 2.
Since the first draft, in the '80s (when the first Universal Character Set (UCS) was developed), one of the primary objectives of the IEEE and the Unicode Consortium (the two main entities that were developing the standard) was to preserve the compatibility with the pre-existing 256 character-set widely used at the time.
Preserving the CodePoints definition, thus preserving compatibility over time, is a strict rule in the Unicode world. This concept and rules apply to all modern variable length Unicode encodings (UTF-8, UTF-16, UTF-16LE, UTF-32 etc.) and to all CodePoints in the Basic Multilingual Plane (CodePoints in the ranges U+0000 to U+D7FF and U+E000 to U+FFFF).
On the other hand, there's no explicit guarantee that the same Local CodePage encoding (often referred to as ANSI Encoding) will produce the same result in two machines, even when the same System (and System version) is in use.
Some other notes about Localization and the Unicode Common Locale Data Repository (CLDR)

You can break the problem down into two parts:
P1. You want to take a string input of space-separated numbers, and convert them to int values:
private static int[] NumbersFromString(string input)
{
var parts = input.Split(new string[] { " " }, StringSplitOptions.RemoveEmptyEntries);
var values = new List<int>(parts.Length);
foreach (var part in parts)
{
int value;
if (!int.TryParse(part, out value))
{
throw new ArgumentException("One or more values in the input string are invalid.", "input");
}
values.Add(value);
}
return values.ToArray();
}
P2. You want to convert those numbers into character representations:
private static string AsciiCodesToString(int[] inputValues)
{
var builder = new StringBuilder();
foreach (var value in inputValues)
{
builder.Append((char)value);
}
return builder.ToString();
}
You can then call it something like this:
Console.WriteLine(AsciiCodesToString(NumbersFromString(input)));
Try it online

Related

Compare string representation of a hexadecimal number [duplicate]

What is the best way to compare two hexadecimal numbers (that is a string)? For instance,
string a = "3F";
string b = "32";
if (a > b)
MessageBox.Show("a is greater");
Should work. (Assuming > has been properly overloaded).
You can always convert them to ints and compare them that way:
int a = int.Parse("3E", System.Globalization.NumberStyles.HexNumber);
int b = int.Parse("32", System.Globalization.NumberStyles.HexNumber);
if (a > b)
MessageBox.Show("a is greater");
Seems safer :)
Convert them to integers and compare the integers.
There is also a simple algo based on String comparisson:
Assumed your numbers have a unique format: always lower case or higher case letters. Leading 0x or not, no leading zeros. Then you can do like this:
If number a has more digits than number b: a > b
If the number of digits is equal you could use String.Compare.
This algo has the advantage it is not limited to 32 or 64 bits.
Here is a fairly robust implementation of hendrik’s suggestion. There are a number of ways it could be optimized if your input strings have known attributes, but it should be able to compare valid hex strings of any size and/or with mixed formats.
public int HexStringCompare(string value1, string value2)
{
string InvalidHexExp = #"[^\dabcdef]";
string HexPaddingExp = #"^(0x)?0*";
//Remove whitespace, "0x" prefix if present, and leading zeros.
//Also make all characters lower case.
string Value1 = Regex.Replace(value1.Trim().ToLower(), HexPaddingExp, "");
string Value2 = Regex.Replace(value2.Trim().ToLower(), HexPaddingExp, "");
//validate that values contain only hex characters
if (Regex.IsMatch(Value1, InvalidHexExp))
{
throw new ArgumentOutOfRangeException("Value1 is not a hex string");
}
if (Regex.IsMatch(Value2, InvalidHexExp))
{
throw new ArgumentOutOfRangeException("Value2 is not a hex string");
}
int Result = Value1.Length.CompareTo(Value2.Length);
if (Result == 0)
{
Result = Value1.CompareTo(Value2);
}
return Result;
}
Using this to answer the OP's question:
if (HexStringCompare(a, b) > 0)
MessageBox.Show("a is greater");

Emoji conversion to a specific string representation

Currently I'm using a HashSet of Tuples called Emoji to replace Emoji to a string representation so that for example the emoji for bomb becomes U0001F4A3. The conversion's done via
Emoji.Aggregate(input, (current, pair) => current.Replace(pair.Item1, pair.Item2));
Works as expected.
However I'm trying to achieve the same thing without making use of predefined list of 2600+ items. Did anyone already achieve such a thing where the Emoji in a string are replaced with their counterpart without leading \?
For example:
"This string contains the unicode character bomb (πŸ’£)"
becomes
"This string contains the unicode character bomb (U0001F4A3)"
It sounds like you're happy to replace any character not in the basic multi-lingual plane with its hex representation. The code to do that is slightly longwinded, but it's pretty simple:
using System;
using System.Text;
class Test
{
static void Main()
{
string text = "This string contains the unicode character bomb (\U0001F4A3)";
Console.WriteLine(ReplaceNonBmpWithHex(text));
}
static string ReplaceNonBmpWithHex(string input)
{
// TODO: If most string don't have any non-BMP characters, consider
// an optimization of checking for high/low surrogate characters first,
// and return input if there aren't any.
StringBuilder builder = new StringBuilder(input.Length);
for (int i = 0; i < input.Length; i++)
{
char c = input[i];
// A surrogate pair is a high surrogate followed by a low surrogate
if (char.IsHighSurrogate(c))
{
if (i == input.Length -1)
{
throw new ArgumentException($"High surrogate at end of string");
}
// Fetch the low surrogate, advancing our counter
i++;
char d = input[i];
if (!char.IsLowSurrogate(d))
{
throw new ArgumentException($"Unmatched low surrogate at index {i-1}");
}
uint highTranslated = (uint) ((c - 0xd800) * 0x400);
uint lowTranslated = (uint) (d - 0xdc00);
uint utf32 = (uint) (highTranslated + lowTranslated + 0x10000);
builder.AppendFormat("U{0:X8}", utf32);
}
// We should never see a low surrogate on its own
else if (char.IsLowSurrogate(c))
{
throw new ArgumentException($"Unmatched low surrogate at index {i}");
}
// Most common case: BMP character; just append it.
else
{
builder.Append(c);
}
}
return builder.ToString();
}
}
Note that this does not attempt to handle the situation where multiple characters are used together, as per Yury's answer. It would replace each modifier/emoji/secondary-char as a separate UXXXXXXXX part.
I'm afraid you have one false assumption here. Emoji is not just a "special Unicode char". Actual length of particular emoji can be 4 or more chars in a row. For instance:
emoji itself
zero-width jointer
secondary char (like graduation cap or microphone)
gender modifier (man or woman)
skin tone modifier (Fitzpatrick Scale)
So, you should take into consideration that variable length for sure.
Examples:
https://emojipedia.org/female-health-worker
https://emojipedia.org/male-farmer-type-1-2

How do I read characters in a string as their UTF-32 decimal values?

I have, for example, this Unicode string, which consists of the Cyclone and the Japanese Castle defined in C# and .NET, which uses UTF-16 for its CLR string encoding:
var value = "πŸŒ€πŸ―";
If you check this, you find very quickly that value.Length = 4 because C# uses UTF-16 encoded strings, so for these reasons I can't just loop on each character and get its UTF-32 decimal value: foreach (var character in value) result = (ulong)character;. It begs the question, how can I get the UTF-32 decimal value for each character in any string?
Cyclone should be 127744 and Japanese Castle should be 127983, but I am looking for a general answer that can take any C# string and always produce a UTF-32 decimal value out of each character inside of it.
I've even tried taking a look at Char.ConvertToUtf32, but this seems to be problematic if, for example:
var value = "aπŸŒ€c🏯";
This has a length of 6. So, how do I know when a new character begins? For example:
Char.ConvertToUtf32(value, 0) 97 int
Char.ConvertToUtf32(value, 1) 127744 int
Char.ConvertToUtf32(value, 2) 'Char.ConvertToUtf32(value, 2)' threw an exception of type 'System.ArgumentException' int {System.ArgumentException}
Char.ConvertToUtf32(value, 3) 99 int
Char.ConvertToUtf32(value, 4) 127983 int
Char.ConvertToUtf32(value, 5) 'Char.ConvertToUtf32(value, 5)' threw an exception of type 'System.ArgumentException' int {System.ArgumentException}
There is also the:
public static int ConvertToUtf32(
char highSurrogate,
char lowSurrogate
)
But for me to use this as well I need to figure out when I have surrogate pairs. How can you do that?
Solution 1
string value = "πŸŒ€πŸ―";
byte[] rawUtf32AsBytes = Encoding.UTF32.GetBytes(value);
int[] rawUtf32 = new int[rawUtf32AsBytes.Length / 4];
Buffer.BlockCopy(rawUtf32AsBytes, 0, rawUtf32, 0, rawUtf32AsBytes.Length);
Solution 2
string value = "πŸŒ€πŸ―";
List<int> rawUtf32list = new List<int>();
for (int i = 0; i < value.Length; i++)
{
if (Char.IsHighSurrogate(value[i]))
{
rawUtf32list.Add(Char.ConvertToUtf32(value[i], value[i + 1]));
i++;
}
else
rawUtf32list.Add((int)value[i]);
}
Update:
Starting with .NET Core 3.0 we have the Rune struct that represents a UTF32 character:
string value = "aπŸŒ€c🏯";
var runes = value.EnumerateRunes();
// writes a:97, πŸŒ€:127744, c:99, 🏯:127983
Console.WriteLine(String.Join(", ", runes.Select(r => $"{r}:{r.Value}")));
Here is an extension method that illustrates one way to do it. The idea is that you can loop through each character of the string, and use char.ConvertToUtf32(string, index) to get the unicode value. If the returned value is larger than 0xFFFF, then you know that the unicode value was composed of a set of surrogate characters, and you can adjust the index value accordingly to skip the 2nd surrogate character.
Extension method:
public static IEnumerable<int> GetUnicodeCodePoints(this string s)
{
for (int i = 0; i < s.Length; i++)
{
int unicodeCodePoint = char.ConvertToUtf32(s, i);
if (unicodeCodePoint > 0xffff)
{
i++;
}
yield return unicodeCodePoint;
}
}
Sample usage:
static void Main(string[] args)
{
string s = "aπŸŒ€c🏯";
foreach(int unicodeCodePoint in s.GetUnicodeCodePoints())
{
Console.WriteLine(unicodeCodePoint);
}
}

Best way to shorten UTF8 string based on byte length

A recent project called for importing data into an Oracle database. The program that will do this is a C# .Net 3.5 app and I'm using the Oracle.DataAccess connection library to handle the actual inserting.
I ran into a problem where I'd receive this error message when inserting a particular field:
ORA-12899 Value too large for column X
I used Field.Substring(0, MaxLength); but still got the error (though not for every record).
Finally I saw what should have been obvious, my string was in ANSI and the field was UTF8. Its length is defined in bytes, not characters.
This gets me to my question. What is the best way to trim my string to fix the MaxLength?
My substring code works by character length. Is there simple C# function that can trim a UT8 string intelligently by byte length (ie not hack off half a character) ?
I think we can do better than naively counting the total length of a string with each addition. LINQ is cool, but it can accidentally encourage inefficient code. What if I wanted the first 80,000 bytes of a giant UTF string? That's a lot of unnecessary counting. "I've got 1 byte. Now I've got 2. Now I've got 13... Now I have 52,384..."
That's silly. Most of the time, at least in l'anglais, we can cut exactly on that nth byte. Even in another language, we're less than 6 bytes away from a good cutting point.
So I'm going to start from #Oren's suggestion, which is to key off of the leading bit of a UTF8 char value. Let's start by cutting right at the n+1th byte, and use Oren's trick to figure out if we need to cut a few bytes earlier.
Three possibilities
If the first byte after the cut has a 0 in the leading bit, I know I'm cutting precisely before a single byte (conventional ASCII) character, and can cut cleanly.
If I have a 11 following the cut, the next byte after the cut is the start of a multi-byte character, so that's a good place to cut too!
If I have a 10, however, I know I'm in the middle of a multi-byte character, and need to go back to check to see where it really starts.
That is, though I want to cut the string after the nth byte, if that n+1th byte comes in the middle of a multi-byte character, cutting would create an invalid UTF8 value. I need to back up until I get to one that starts with 11 and cut just before it.
Code
Notes: I'm using stuff like Convert.ToByte("11000000", 2) so that it's easy to tell what bits I'm masking (a little more about bit masking here). In a nutshell, I'm &ing to return what's in the byte's first two bits and bringing back 0s for the rest. Then I check the XX from XX000000 to see if it's 10 or 11, where appropriate.
I found out today that C# 6.0 might actually support binary representations, which is cool, but we'll keep using this kludge for now to illustrate what's going on.
The PadLeft is just because I'm overly OCD about output to the Console.
So here's a function that'll cut you down to a string that's n bytes long or the greatest number less than n that's ends with a "complete" UTF8 character.
public static string CutToUTF8Length(string str, int byteLength)
{
byte[] byteArray = Encoding.UTF8.GetBytes(str);
string returnValue = string.Empty;
if (byteArray.Length > byteLength)
{
int bytePointer = byteLength;
// Check high bit to see if we're [potentially] in the middle of a multi-byte char
if (bytePointer >= 0
&& (byteArray[bytePointer] & Convert.ToByte("10000000", 2)) > 0)
{
// If so, keep walking back until we have a byte starting with `11`,
// which means the first byte of a multi-byte UTF8 character.
while (bytePointer >= 0
&& Convert.ToByte("11000000", 2) != (byteArray[bytePointer] & Convert.ToByte("11000000", 2)))
{
bytePointer--;
}
}
// See if we had 1s in the high bit all the way back. If so, we're toast. Return empty string.
if (0 != bytePointer)
{
returnValue = Encoding.UTF8.GetString(byteArray, 0, bytePointer); // hat tip to #NealEhardt! Well played. ;^)
}
}
else
{
returnValue = str;
}
return returnValue;
}
I initially wrote this as a string extension. Just add back the this before string str to put it back into extension format, of course. I removed the this so that we could just slap the method into Program.cs in a simple console app to demonstrate.
Test and expected output
Here's a good test case, with the output it create below, written expecting to be the Main method in a simple console app's Program.cs.
static void Main(string[] args)
{
string testValue = "12345β€œβ€67890”";
for (int i = 0; i < 15; i++)
{
string cutValue = Program.CutToUTF8Length(testValue, i);
Console.WriteLine(i.ToString().PadLeft(2) +
": " + Encoding.UTF8.GetByteCount(cutValue).ToString().PadLeft(2) +
":: " + cutValue);
}
Console.WriteLine();
Console.WriteLine();
foreach (byte b in Encoding.UTF8.GetBytes(testValue))
{
Console.WriteLine(b.ToString().PadLeft(3) + " " + (char)b);
}
Console.WriteLine("Return to end.");
Console.ReadLine();
}
Output follows. Notice that the "smart quotes" in testValue are three bytes long in UTF8 (though when we write the chars to the console in ASCII, it outputs dumb quotes). Also note the ?s output for the second and third bytes of each smart quote in the output.
The first five characters of our testValue are single bytes in UTF8, so 0-5 byte values should be 0-5 characters. Then we have a three-byte smart quote, which can't be included in its entirety until 5 + 3 bytes. Sure enough, we see that pop out at the call for 8.Our next smart quote pops out at 8 + 3 = 11, and then we're back to single byte characters through 14.
0: 0::
1: 1:: 1
2: 2:: 12
3: 3:: 123
4: 4:: 1234
5: 5:: 12345
6: 5:: 12345
7: 5:: 12345
8: 8:: 12345"
9: 8:: 12345"
10: 8:: 12345"
11: 11:: 12345""
12: 12:: 12345""6
13: 13:: 12345""67
14: 14:: 12345""678
49 1
50 2
51 3
52 4
53 5
226 Γ’
128 ?
156 ?
226 Γ’
128 ?
157 ?
54 6
55 7
56 8
57 9
48 0
226 Γ’
128 ?
157 ?
Return to end.
So that's kind of fun, and I'm in just before the question's five year anniversary. Though Oren's description of the bits had a small error, that's exactly the trick you want to use. Thanks for the question; neat.
Here are two possible solution - a LINQ one-liner processing the input left to right and a traditional for-loop processing the input from right to left. Which processing direction is faster depends on the string length, the allowed byte length, and the number and distribution of multibyte characters and is hard to give a general suggestion. The decision between LINQ and traditional code I probably a matter of taste (or maybe speed).
If speed matters, one could think about just accumulating the byte length of each character until reaching the maximum length instead of calculating the byte length of the whole string in each iteration. But I am not sure if this will work because I don't know UTF-8 encoding well enough. I could theoreticaly imagine that the byte length of a string does not equal the sum of the byte lengths of all characters.
public static String LimitByteLength(String input, Int32 maxLength)
{
return new String(input
.TakeWhile((c, i) =>
Encoding.UTF8.GetByteCount(input.Substring(0, i + 1)) <= maxLength)
.ToArray());
}
public static String LimitByteLength2(String input, Int32 maxLength)
{
for (Int32 i = input.Length - 1; i >= 0; i--)
{
if (Encoding.UTF8.GetByteCount(input.Substring(0, i + 1)) <= maxLength)
{
return input.Substring(0, i + 1);
}
}
return String.Empty;
}
Shorter version of ruffin's answer. Takes advantage of the design of UTF8:
public static string LimitUtf8ByteCount(this string s, int n)
{
// quick test (we probably won't be trimming most of the time)
if (Encoding.UTF8.GetByteCount(s) <= n)
return s;
// get the bytes
var a = Encoding.UTF8.GetBytes(s);
// if we are in the middle of a character (highest two bits are 10)
if (n > 0 && ( a[n]&0xC0 ) == 0x80)
{
// remove all bytes whose two highest bits are 10
// and one more (start of multi-byte sequence - highest bits should be 11)
while (--n > 0 && ( a[n]&0xC0 ) == 0x80)
;
}
// convert back to string (with the limit adjusted)
return Encoding.UTF8.GetString(a, 0, n);
}
All of the other answers appear to miss the fact that this functionality is already built into .NET, in the Encoder class. For bonus points, this approach will also work for other encodings.
public static string LimitByteLength(string message, int maxLength)
{
if (string.IsNullOrEmpty(message) || Encoding.UTF8.GetByteCount(message) <= maxLength)
{
return message;
}
var encoder = Encoding.UTF8.GetEncoder();
byte[] buffer = new byte[maxLength];
char[] messageChars = message.ToCharArray();
encoder.Convert(
chars: messageChars,
charIndex: 0,
charCount: messageChars.Length,
bytes: buffer,
byteIndex: 0,
byteCount: buffer.Length,
flush: false,
charsUsed: out int charsUsed,
bytesUsed: out int bytesUsed,
completed: out bool completed);
// I don't think we can return message.Substring(0, charsUsed)
// as that's the number of UTF-16 chars, not the number of codepoints
// (think about surrogate pairs). Therefore I think we need to
// actually convert bytes back into a new string
return Encoding.UTF8.GetString(buffer, 0, bytesUsed);
}
If you're using .NET Standard 2.1+, you can simplify it a bit:
public static string LimitByteLength(string message, int maxLength)
{
if (string.IsNullOrEmpty(message) || Encoding.UTF8.GetByteCount(message) <= maxLength)
{
return message;
}
var encoder = Encoding.UTF8.GetEncoder();
byte[] buffer = new byte[maxLength];
encoder.Convert(message.AsSpan(), buffer.AsSpan(), false, out _, out int bytesUsed, out _);
return Encoding.UTF8.GetString(buffer, 0, bytesUsed);
}
None of the other answers account for extended grapheme clusters, such as πŸ‘©πŸ½β€πŸš’. This is composed of 4 Unicode scalars (πŸ‘©, 🏽, a zero-width joiner, and πŸš’), so you need knowledge of the Unicode standard to avoid splitting it in the middle and producing πŸ‘© or πŸ‘©πŸ½.
In .NET 5 onwards, you can write this as:
public static string LimitByteLength(string message, int maxLength)
{
if (string.IsNullOrEmpty(message) || Encoding.UTF8.GetByteCount(message) <= maxLength)
{
return message;
}
var enumerator = StringInfo.GetTextElementEnumerator(message);
var result = new StringBuilder();
int lengthBytes = 0;
while (enumerator.MoveNext())
{
lengthBytes += Encoding.UTF8.GetByteCount(enumerator.GetTextElement());
if (lengthBytes <= maxLength)
{
result.Append(enumerator.GetTextElement());
}
}
return result.ToString();
}
(This same code runs on earlier versions of .NET, but due to a bug it won't produce the correct result before .NET 5).
If a UTF-8 byte has a zero-valued high order bit, it's the beginning of a character. If its high order bit is 1, it's in the 'middle' of a character. The ability to detect the beginning of a character was an explicit design goal of UTF-8.
Check out the Description section of the wikipedia article for more detail.
Is there a reason that you need the database column to be declared in terms of bytes? That's the default, but it's not a particularly useful default if the database character set is variable width. I'd strongly prefer declaring the column in terms of characters.
CREATE TABLE length_example (
col1 VARCHAR2( 10 BYTE ),
col2 VARCHAR2( 10 CHAR )
);
This will create a table where COL1 will store 10 bytes of data and col2 will store 10 characters worth of data. Character length semantics make far more sense in a UTF8 database.
Assuming you want all the tables you create to use character length semantics by default, you can set the initialization parameter NLS_LENGTH_SEMANTICS to CHAR. At that point, any tables you create will default to using character length semantics rather than byte length semantics if you don't specify CHAR or BYTE in the field length.
Following Oren Trutner's comment here are two more solutions to the problem:
here we count the number of bytes to remove from the end of the string according to each character at the end of the string, so we don't evaluate the entire string in every iteration.
string str = "朣ζ₯’琴执执 瑩桻牑ζ₯§η‘°ζ‰§ζ‰§η§ζ΅»η‰‘ζ₯§ζ•¬η‘¦ η€° η΅Έζœ£ζ’ζ‰§η§ζ‰»ζ‘ζ«ζ½²ζΉ΅ ζ½£"
int maxBytesLength = 30;
var bytesArr = Encoding.UTF8.GetBytes(str);
int bytesToRemove = 0;
int lastIndexInString = str.Length -1;
while(bytesArr.Length - bytesToRemove > maxBytesLength)
{
bytesToRemove += Encoding.UTF8.GetByteCount(new char[] {str[lastIndexInString]} );
--lastIndexInString;
}
string trimmedString = Encoding.UTF8.GetString(bytesArr,0,bytesArr.Length - bytesToRemove);
//Encoding.UTF8.GetByteCount(trimmedString);//get the actual length, will be <= 朣ζ₯’琴执执 瑩桻牑ζ₯§η‘°ζ‰§ζ‰§η§ζ΅»η‰‘ζ₯§ζ•¬η‘¦ η€° η΅Έζœ£ζ’ζ‰§η§ζ‰»ζ‘ζ«ζ½²ζΉ΅ ζ½£ζ½¬ζ˜£ζ˜Έζ˜Έζ…’ζ­£
And an even more efficient(and maintainable) solution:
get the string from the bytes array according to desired length and cut the last character because it might be corrupted
string str = "朣ζ₯’琴执执 瑩桻牑ζ₯§η‘°ζ‰§ζ‰§η§ζ΅»η‰‘ζ₯§ζ•¬η‘¦ η€° η΅Έζœ£ζ’ζ‰§η§ζ‰»ζ‘ζ«ζ½²ζΉ΅ ζ½£"
int maxBytesLength = 30;
string trimmedWithDirtyLastChar = Encoding.UTF8.GetString(Encoding.UTF8.GetBytes(str),0,maxBytesLength);
string trimmedString = trimmedWithDirtyLastChar.Substring(0,trimmedWithDirtyLastChar.Length - 1);
The only downside with the second solution is that we might cut a perfectly fine last character, but we are already cutting the string, so it might fit with the requirements.
Thanks to Shhade who thought about the second solution
This is another solution based on binary search:
public string LimitToUTF8ByteLength(string text, int size)
{
if (size <= 0)
{
return string.Empty;
}
int maxLength = text.Length;
int minLength = 0;
int length = maxLength;
while (maxLength >= minLength)
{
length = (maxLength + minLength) / 2;
int byteLength = Encoding.UTF8.GetByteCount(text.Substring(0, length));
if (byteLength > size)
{
maxLength = length - 1;
}
else if (byteLength < size)
{
minLength = length + 1;
}
else
{
return text.Substring(0, length);
}
}
// Round down the result
string result = text.Substring(0, length);
if (size >= Encoding.UTF8.GetByteCount(result))
{
return result;
}
else
{
return text.Substring(0, length - 1);
}
}
public static string LimitByteLength3(string input, Int32 maxLenth)
{
string result = input;
int byteCount = Encoding.UTF8.GetByteCount(input);
if (byteCount > maxLenth)
{
var byteArray = Encoding.UTF8.GetBytes(input);
result = Encoding.UTF8.GetString(byteArray, 0, maxLenth);
}
return result;
}

How would you get an array of Unicode code points from a .NET String?

I have a list of character range restrictions that I need to check a string against, but the char type in .NET is UTF-16 and therefore some characters become wacky (surrogate) pairs instead. Thus when enumerating all the char's in a string, I don't get the 32-bit Unicode code points and some comparisons with high values fail.
I understand Unicode well enough that I could parse the bytes myself if necessary, but I'm looking for a C#/.NET Framework BCL solution. So ...
How would you convert a string to an array (int[]) of 32-bit Unicode code points?
You are asking about code points. In UTF-16 (C#'s char) there are only two possibilities:
The character is from the Basic Multilingual Plane, and is encoded by a single code unit.
The character is outside the BMP, and encoded using a surrogare high-low pair of code units
Therefore, assuming the string is valid, this returns an array of code points for a given string:
public static int[] ToCodePoints(string str)
{
if (str == null)
throw new ArgumentNullException("str");
var codePoints = new List<int>(str.Length);
for (int i = 0; i < str.Length; i++)
{
codePoints.Add(Char.ConvertToUtf32(str, i));
if (Char.IsHighSurrogate(str[i]))
i += 1;
}
return codePoints.ToArray();
}
An example with a surrogate pair πŸŒ€ and a composed character Γ±:
ToCodePoints("\U0001F300 El Ni\u006E\u0303o"); // πŸŒ€ El NiΓ±o
// { 0x1f300, 0x20, 0x45, 0x6c, 0x20, 0x4e, 0x69, 0x6e, 0x303, 0x6f } // πŸŒ€ E l N i n Μƒβ—Œ o
Here's another example. These two code points represents a 32th musical note with a staccato accent, both surrogate pairs:
ToCodePoints("\U0001D162\U0001D181"); // 𝅒𝆁
// { 0x1d162, 0x1d181 } // 𝅒 π†β—Œ
When C-normalized, they are decomposed into a notehead, combining stem, combining flag and combining accent-staccato, all surrogate pairs:
ToCodePoints("\U0001D162\U0001D181".Normalize()); // π…˜π…₯𝅰𝆁
// { 0x1d158, 0x1d165, 0x1d170, 0x1d181 } // π…˜ π…₯ 𝅰 π†β—Œ
Note that leppie's solution is not correct. The question is about code points, not text elements. A text element is a combination of code points that together form a single grapheme. For example, in the example above, the Γ± in the string is represented by a Latin lowercase n followed by a combining tilde Μƒβ—Œ. Leppie's solution discards any combining characters that cannot be normalized into a single code point.
This answer is not correct. See #Virtlink's answer for the correct one.
static int[] ExtractScalars(string s)
{
if (!s.IsNormalized())
{
s = s.Normalize();
}
List<int> chars = new List<int>((s.Length * 3) / 2);
var ee = StringInfo.GetTextElementEnumerator(s);
while (ee.MoveNext())
{
string e = ee.GetTextElement();
chars.Add(char.ConvertToUtf32(e, 0));
}
return chars.ToArray();
}
Notes: Normalization is required to deal with composite characters.
Doesn't seem like it should be much more complicated than this:
public static IEnumerable<int> Utf32CodePoints( this IEnumerable<char> s )
{
bool useBigEndian = !BitConverter.IsLittleEndian;
Encoding utf32 = new UTF32Encoding( useBigEndian , false , true ) ;
byte[] octets = utf32.GetBytes( s ) ;
for ( int i = 0 ; i < octets.Length ; i+=4 )
{
int codePoint = BitConverter.ToInt32(octets,i);
yield return codePoint;
}
}
I came up with the same approach suggested by Nicholas (and Jeppe), just shorter:
public static IEnumerable<int> GetCodePoints(this string s) {
var utf32 = new UTF32Encoding(!BitConverter.IsLittleEndian, false, true);
var bytes = utf32.GetBytes(s);
return Enumerable.Range(0, bytes.Length / 4).Select(i => BitConverter.ToInt32(bytes, i * 4));
}
The enumeration was all I needed, but getting an array is trivial:
int[] codePoints = myString.GetCodePoints().ToArray();
This solution produces the same results as the solution by Daniel A.A. Pelsmaeker but is a little bit shorter:
public static int[] ToCodePoints(string s)
{
byte[] utf32bytes = Encoding.UTF32.GetBytes(s);
int[] codepoints = new int[utf32bytes.Length / 4];
Buffer.BlockCopy(utf32bytes, 0, codepoints, 0, utf32bytes.Length);
return codepoints;
}

Categories