Generate unique 16 character strings across multiple runs - c#

I have to generate 16 character strings, about 1,00,000 a month. They should be such that they don't repeat across multiple runs (once a month, every month). What is the best method to achieve this? Is using hash functions a good idea?
The string can have A-Z and 0-9 only.
This is to be done using C#.
EDIT: The strings should be random. So, keeping a simple counter is not an option.

Since you're limited to 16 alphanumeric characters, a GUID is probably not an option - it requires the full 128 bits to be unique and whilst that will generate a 16 character string, it will not necessarily fit the alphanumeric constraint.
You could have a simple counter and return the last 64 bits of an MD5 hash and check for uniqueness each time.
//parse out hex digits in calling code
static long NextHash(HashSet<long> hashes, int count)
{
System.Security.Cryptography.MD5 md5 = System.Security.Cryptography.MD5.Create();
long l = BitConverter.ToInt64(md5.ComputeHash(IntToArray(count)));
if(!hashes.Contains(l)){
hashes.Add(l);
return l;
} else return -1; //check this in calling code for failure
}
static byte[] IntToArray(int i)
{
byte[] bytes = new byte[4];
for(int j=0;j<4;j++){
bytes[j] = (byte)i;
i>>=8;
}
}
You could do something similar for GUIDS, but I don't know how likely collisions are when you're only looking at a substring.
MD5 hashes have the advantage of "appearing" more random, if that's at all relevant.

You have not specified the language.
PHP,
http://php.net/manual/en/function.uniqid.php
echo rand(0,999).uniqid();
rand(0,999) = 3 characters randomly
uniqid() = 13 randomly characters

I don't know if that satisfies you, but I came up with sth like that
static List<string> generate(int count)
{
List<string> strings = new List<string>();
while (strings.Count < count)
{
Guid g = Guid.NewGuid();
string GuidString = g.ToString();
GuidString = GuidString.Replace("-", "");
GuidString = GuidString.Remove(16);
if (!strings.Contains(GuidString))
strings.Add(GuidString);
}
return strings;
}

Related

Encrypt int generating unique string and Decrypt string for getting int again

I need to generate unique strings starting from int number (id). The length must be proportionally incremental, so for tiny ids I have to generate unique strings of four characters. For big ids I have to generate strings much more complex, with growing size when needed (max 8 digit) in order to accomplish the uniqueness.
All this procedure must be done with two opposite functions:
from id -> obtain string
from string -> obtain id
Unique strings must be composed by numbers and characters of a specific set (379CDEFHKJLMNPQRTUWXY)
Is there a well know algorithm to do this? I need to do this in c# or better in tsql. Ideas are also appreciated.
Edit
I have "simply" the need to encode (and than decode) the number. I've implemented this routines for my alphabet (21 symbols length):
Encode:
public static String BfEncode(long input)
{
if (input < 0) throw new ArgumentOutOfRangeException("input", input, "input cannot be negative");
char[] clistarr = BfCharList.ToCharArray();
var result = new Stack<char>();
while (input != 0)
{
result.Push(clistarr[input % 21]);
input /= 21;
}
return new string(result.ToArray());
}
And decode:
public static Int64 BfDecode(string input)
{
var reversed = input.ToLower().Reverse();
long result = 0;
int pos = 0;
foreach (char c in reversed)
{
result += BfCharList.IndexOf(c.ToString().ToUpper()) * (long)Math.Pow(21, pos);
pos++;
}
return result;
}
I've generated example strings in a loop starting from 10000 to 10000000 (!). Starting from 10K I can generate strings of 4 digits length. After the generation I've put all the strings in a list and checked for uniqueness (I've done it with parallel foreach...). At the number 122291 the routine thrown an exception because there is a duplicate! Is it possibile?
The base conversion to a custom alphabet is not a good solution?

Is this the correct way to truncate a hash in C#?

I have a requirement to hash input strings and produce 14 digit decimal numbers as output.
The math I am using tells me I can have, at maximum, a 46 bit unsigned integer.
I am aware that a 46 bit uint means less collision resistance for any potential hash function. However, the number of hashes I am creating keeps the collision probability in an acceptable range.
I would be most grateful if the community could help me verify that my method for truncating a hash to 46 bits is solid. I have a gut feeling that there are optimizations and/or easier ways to do this. My function is as follows (where bitLength is 46 when this function is called):
public static UInt64 GetTruncatedMd5Hash(string input, int bitLength)
{
var md5Hash = MD5.Create();
byte[] fullHashBytes = md5Hash.ComputeHash(Encoding.UTF8.GetBytes(input));
var fullHashBits = new BitArray(fullHashBytes);
// BitArray stores LSB of each byte in lowest indexes, so reversing...
ReverseBitArray(fullHashBits);
// truncate by copying only number of bits specified by bitLength param
var truncatedHashBits = new BitArray(bitLength);
for (int i = 0; i < bitLength - 1; i++)
{
truncatedHashBits[i] = fullHashBits[i];
}
byte[] truncatedHashBytes = new byte[8];
truncatedHashBits.CopyTo(truncatedHashBytes, 0);
return BitConverter.ToUInt64(truncatedHashBytes, 0);
}
Thanks for taking a look at this question. I appreciate any feedback!
With the help of the comments above, I crafted the following solution:
public static UInt64 GetTruncatedMd5Hash(string input, int bitLength)
{
if (string.IsNullOrWhiteSpace(input)) throw new ArgumentException("input must not be null or whitespace");
if(bitLength > 64) throw new ArgumentException("bitLength must be <= 64");
var md5Hash = MD5.Create();
byte[] fullHashBytes = md5Hash.ComputeHash(Encoding.UTF8.GetBytes(input));
if(bitLength == 64)
return BitConverter.ToUInt64(fullHashBytes, 0);
var bitMask = (1UL << bitLength) - 1UL;
return BitConverter.ToUInt64(fullHashBytes, 0) & bitMask;
}
It's much tighter (and faster) than what I was trying to do before.

Search ReadAllBytes for specific values

I am writing a program that reads '.exe' files and stores their hex values in an array of bytes for comparison with an array containing a series of values. (like a very simple virus scanner)
byte[] buffer = File.ReadAllBytes(currentDirectoryContents[j]);
I have then used BitConverter to create a single string of these values
string hex = BitConverter.ToString(buffer);
The next step is to search this string for a series of values(definitions) and return positive for a match. This is where I am running into problems. My definitions are hex values but created and saved in notepad as defintions.xyz
string[] definitions = File.ReadAllLines(#"C:\definitions.xyz");
I had been trying to read them into a string array and compare the definition elements of the array with string hex
bool[] test = new bool[currentDirectoryContents.Length];
test[j] = hex.Contains(definitions[i]);
This IS a section from a piece of homework, which is why I am not posting my entire code for the program. I had not used C# before last Friday so am most likely making silly mistakes at this point.
Any advice much appreciated :)
It is pretty unclear exactly what kind of format you use of the definitions. Base64 is a good encoding for a byte[], you can rapidly convert back and forth with Convert.ToBase64String and Convert.FromBase64String(). But your question suggests the bytes are encoded in hex. Let's assume it looks like "01020304" for a new byte[] { 1, 2, 3, 4}. Then this helper function converts such a string back to a byte[]:
static byte[] Hex2Bytes(string hex) {
if (hex.Length % 2 != 0) throw new ArgumentException();
var retval = new byte[hex.Length / 2];
for (int ix = 0; ix < hex.Length; ix += 2) {
retval[ix / 2] = byte.Parse(hex.Substring(ix, 2), System.Globalization.NumberStyles.HexNumber);
}
return retval;
}
You can now do a fast pattern search with an algorithm like Boyer-Moore.
I expect you understand that this is a very inefficient way to do it. But except for that, you should just do something like this:
bool[] test = new bool[currentDirectoryContents.Length];
for(int i=0;i<test.Length;i++){
byte[] buffer = File.ReadAllBytes(currentDirectoryContents[j]);
string hex = BitConverter.ToString(buffer);
test[i] = ContainsAny(hex, definitions);
}
bool ContainsAny(string s, string[] values){
foreach(string value in values){
if(s.Contains(value){
return true;
}
}
return false;
}
If you can use LINQ, you can do it like this:
var test = currentDirectoryContents.Select(
file=>definitions.Any(
definition =>
BitConverter.ToString(
File.ReadAllBytes(file)
).Contains(definition)
)
).ToArray();
Also, make sure that your definitions-file is formatted in a way that matches the output of BitConverter.ToString(): upper-case with dashes separating each encoded byte:
12-AB-F0-34
54-AC-FF-01-02

Encrypt a number to another number of the same length

I need a way to take a 12 digit number and encrypt it to a different 12 digit number (no characters other than 0123456789). Then at a later point I need to be able to decrypt the encrypted number back to the original number.
It is important that it isn't obvious if 2 encrypted numbers are in order. So for instance if I encrypt 0000000000001 it should look totally different when encrypted than 000000000002. It doesn't have to be the most secure thing in the world, but the more secure the better.
I've been looking around a lot but haven't found anything that seems to be a perfect fit. From what I've seen some type of XOR might be the easiest way to go, but I'm not sure how to do this.
Thanks,
Jim
I ended up solving this thanks to you guys using "FPE from a prefix cipher" from the wikipedia page http://en.wikipedia.org/wiki/Format-preserving_encryption. I'll give the basic steps below to hopefully be helpful for someone in the future.
NOTE - I'm sure any expert will tell you this is a hack. The numbers seemed random and it was secure enough for what I needed, but if security is a big concern use something else. I'm sure experts can point to holes in what I did. My only goal for posting this is because I would have found it useful when doing my search for an answer to the problem. Also only use this in situations where it couldn't be decompiled.
I was going to post steps, but its too much to explain. I'll just post my code. This is my proof of concept code I still need to clean up, but you'll get the idea. Note my code is specific to a 12 digit number, but adjusting for others should be easy. Max is probably 16 with the way I did it.
public static string DoEncrypt(string unencryptedString)
{
string encryptedString = "";
unencryptedString = new string(unencryptedString.ToCharArray().Reverse().ToArray());
foreach (char character in unencryptedString.ToCharArray())
{
string randomizationSeed = (encryptedString.Length > 0) ? unencryptedString.Substring(0, encryptedString.Length) : "";
encryptedString += GetRandomSubstitutionArray(randomizationSeed)[int.Parse(character.ToString())];
}
return Shuffle(encryptedString);
}
public static string DoDecrypt(string encryptedString)
{
// Unshuffle the string first to make processing easier.
encryptedString = Unshuffle(encryptedString);
string unencryptedString = "";
foreach (char character in encryptedString.ToCharArray().ToArray())
unencryptedString += GetRandomSubstitutionArray(unencryptedString).IndexOf(int.Parse(character.ToString()));
// Reverse string since encrypted string was reversed while processing.
return new string(unencryptedString.ToCharArray().Reverse().ToArray());
}
private static string Shuffle(string unshuffled)
{
char[] unshuffledCharacters = unshuffled.ToCharArray();
char[] shuffledCharacters = new char[12];
shuffledCharacters[0] = unshuffledCharacters[2];
shuffledCharacters[1] = unshuffledCharacters[7];
shuffledCharacters[2] = unshuffledCharacters[10];
shuffledCharacters[3] = unshuffledCharacters[5];
shuffledCharacters[4] = unshuffledCharacters[3];
shuffledCharacters[5] = unshuffledCharacters[1];
shuffledCharacters[6] = unshuffledCharacters[0];
shuffledCharacters[7] = unshuffledCharacters[4];
shuffledCharacters[8] = unshuffledCharacters[8];
shuffledCharacters[9] = unshuffledCharacters[11];
shuffledCharacters[10] = unshuffledCharacters[6];
shuffledCharacters[11] = unshuffledCharacters[9];
return new string(shuffledCharacters);
}
private static string Unshuffle(string shuffled)
{
char[] shuffledCharacters = shuffled.ToCharArray();
char[] unshuffledCharacters = new char[12];
unshuffledCharacters[0] = shuffledCharacters[6];
unshuffledCharacters[1] = shuffledCharacters[5];
unshuffledCharacters[2] = shuffledCharacters[0];
unshuffledCharacters[3] = shuffledCharacters[4];
unshuffledCharacters[4] = shuffledCharacters[7];
unshuffledCharacters[5] = shuffledCharacters[3];
unshuffledCharacters[6] = shuffledCharacters[10];
unshuffledCharacters[7] = shuffledCharacters[1];
unshuffledCharacters[8] = shuffledCharacters[8];
unshuffledCharacters[9] = shuffledCharacters[11];
unshuffledCharacters[10] = shuffledCharacters[2];
unshuffledCharacters[11] = shuffledCharacters[9];
return new string(unshuffledCharacters);
}
public static string DoPrefixCipherEncrypt(string strIn, byte[] btKey)
{
if (strIn.Length < 1)
return strIn;
// Convert the input string to a byte array
byte[] btToEncrypt = System.Text.Encoding.Unicode.GetBytes(strIn);
RijndaelManaged cryptoRijndael = new RijndaelManaged();
cryptoRijndael.Mode =
CipherMode.ECB;//Doesn't require Initialization Vector
cryptoRijndael.Padding =
PaddingMode.PKCS7;
// Create a key (No IV needed because we are using ECB mode)
ASCIIEncoding textConverter = new ASCIIEncoding();
// Get an encryptor
ICryptoTransform ictEncryptor = cryptoRijndael.CreateEncryptor(btKey, null);
// Encrypt the data...
MemoryStream msEncrypt = new MemoryStream();
CryptoStream csEncrypt = new CryptoStream(msEncrypt, ictEncryptor, CryptoStreamMode.Write);
// Write all data to the crypto stream to encrypt it
csEncrypt.Write(btToEncrypt, 0, btToEncrypt.Length);
csEncrypt.Close();
//flush, close, dispose
// Get the encrypted array of bytes
byte[] btEncrypted = msEncrypt.ToArray();
// Convert the resulting encrypted byte array to string for return
return (Convert.ToBase64String(btEncrypted));
}
private static List<int> GetRandomSubstitutionArray(string number)
{
// Pad number as needed to achieve longer key length and seed more randomly.
// NOTE I didn't want to make the code here available and it would take too longer to clean, so I'll tell you what I did. I basically took every number seed that was passed in and prefixed it and postfixed it with some values to make it 16 characters long and to get a more unique result. For example:
// if (number.Length = 15)
// number = "Y" + number;
// if (number.Length = 14)
// number = "7" + number + "z";
// etc - hey I already said this is a hack ;)
// We pass in the current number as the password to an AES encryption of each of the
// digits 0 - 9. This returns us a set of values that we can then sort and get a
// random order for the digits based on the current state of the number.
Dictionary<string, int> prefixCipherResults = new Dictionary<string, int>();
for (int ndx = 0; ndx < 10; ndx++)
prefixCipherResults.Add(DoPrefixCipherEncrypt(ndx.ToString(), Encoding.UTF8.GetBytes(number)), ndx);
// Order the results and loop through to build your int array.
List<int> group = new List<int>();
foreach (string key in prefixCipherResults.Keys.OrderBy(k => k))
group.Add(prefixCipherResults[key]);
return group;
}
One more way for simple encryption, you can just substruct each number from 10.
For example
initial numbers: 123456
10-1 = 9
10-2 = 8
10-3 = 7
etc.
and you will get
987654
You can combine it with XOR for more secure encryption.
What you're talking about is kinda like a one-time pad. A key the same length as the plaintext and then doing some modulo math on each individual character.
A xor B = C
C xor B = A
or in other words
A xor B xor B = A
As long as you don't use the same key B on multiple different inputs (e.g. B has to be unique, every single time you encrypt), then in theory you can never recover the original A without knowing what B was. If you use the same B multiple times, then all bets are off.
comment followup:
You shouldn't end up with more bits aftewards than you started with. xor just flips bits, it doesn't have any carry functionality. Ending up with 6 digits is just odd... As for code:
$plaintext = array(digit1, digit2, digit3, digit4, digit5, digit6);
$key = array(key1, key2, key3, key4, key5, key6);
$ciphertext = array()
# encryption
foreach($plaintext as $idx => $char) {
$ciphertext[$idx] = $char xor $key[$idx];
}
# decryption
foreach($ciphertext as $idx => $char) {
$decrypted[$idx] = $char xor $key[$idx];
}
Just doing this as an array for simplicity. For actual data you'd work on a per-byte or per-word basis, and just xor each chunk in sequence. You can use a key string shorter than the input, but that makes it easier to reverse engineer the key. In theory, you could use a single byte to do the xor'ing, but then you've just basically achieved the bit-level equivalent of rot-13.
For example you can add digits of your number with digits some const (214354178963...whatever) and apply "~" operator (reverse all bits) this is not safely but ensure you can decrypt your number allways.
anyone with reflector or ildasm will be able to hack such an encryption algorithm.
I don't know what is your business requirement but you have to know that.
If there's enough wriggle-room in the requirements that you can accept 16 hexadecimal digits as the encrypted side, just interpret the 12 digit decimal number as a 64bit plaintext and use a 64 bit block cipher like Blowfish, Triple-DES or IDEA.

Best way to shorten UTF8 string based on byte length

A recent project called for importing data into an Oracle database. The program that will do this is a C# .Net 3.5 app and I'm using the Oracle.DataAccess connection library to handle the actual inserting.
I ran into a problem where I'd receive this error message when inserting a particular field:
ORA-12899 Value too large for column X
I used Field.Substring(0, MaxLength); but still got the error (though not for every record).
Finally I saw what should have been obvious, my string was in ANSI and the field was UTF8. Its length is defined in bytes, not characters.
This gets me to my question. What is the best way to trim my string to fix the MaxLength?
My substring code works by character length. Is there simple C# function that can trim a UT8 string intelligently by byte length (ie not hack off half a character) ?
I think we can do better than naively counting the total length of a string with each addition. LINQ is cool, but it can accidentally encourage inefficient code. What if I wanted the first 80,000 bytes of a giant UTF string? That's a lot of unnecessary counting. "I've got 1 byte. Now I've got 2. Now I've got 13... Now I have 52,384..."
That's silly. Most of the time, at least in l'anglais, we can cut exactly on that nth byte. Even in another language, we're less than 6 bytes away from a good cutting point.
So I'm going to start from #Oren's suggestion, which is to key off of the leading bit of a UTF8 char value. Let's start by cutting right at the n+1th byte, and use Oren's trick to figure out if we need to cut a few bytes earlier.
Three possibilities
If the first byte after the cut has a 0 in the leading bit, I know I'm cutting precisely before a single byte (conventional ASCII) character, and can cut cleanly.
If I have a 11 following the cut, the next byte after the cut is the start of a multi-byte character, so that's a good place to cut too!
If I have a 10, however, I know I'm in the middle of a multi-byte character, and need to go back to check to see where it really starts.
That is, though I want to cut the string after the nth byte, if that n+1th byte comes in the middle of a multi-byte character, cutting would create an invalid UTF8 value. I need to back up until I get to one that starts with 11 and cut just before it.
Code
Notes: I'm using stuff like Convert.ToByte("11000000", 2) so that it's easy to tell what bits I'm masking (a little more about bit masking here). In a nutshell, I'm &ing to return what's in the byte's first two bits and bringing back 0s for the rest. Then I check the XX from XX000000 to see if it's 10 or 11, where appropriate.
I found out today that C# 6.0 might actually support binary representations, which is cool, but we'll keep using this kludge for now to illustrate what's going on.
The PadLeft is just because I'm overly OCD about output to the Console.
So here's a function that'll cut you down to a string that's n bytes long or the greatest number less than n that's ends with a "complete" UTF8 character.
public static string CutToUTF8Length(string str, int byteLength)
{
byte[] byteArray = Encoding.UTF8.GetBytes(str);
string returnValue = string.Empty;
if (byteArray.Length > byteLength)
{
int bytePointer = byteLength;
// Check high bit to see if we're [potentially] in the middle of a multi-byte char
if (bytePointer >= 0
&& (byteArray[bytePointer] & Convert.ToByte("10000000", 2)) > 0)
{
// If so, keep walking back until we have a byte starting with `11`,
// which means the first byte of a multi-byte UTF8 character.
while (bytePointer >= 0
&& Convert.ToByte("11000000", 2) != (byteArray[bytePointer] & Convert.ToByte("11000000", 2)))
{
bytePointer--;
}
}
// See if we had 1s in the high bit all the way back. If so, we're toast. Return empty string.
if (0 != bytePointer)
{
returnValue = Encoding.UTF8.GetString(byteArray, 0, bytePointer); // hat tip to #NealEhardt! Well played. ;^)
}
}
else
{
returnValue = str;
}
return returnValue;
}
I initially wrote this as a string extension. Just add back the this before string str to put it back into extension format, of course. I removed the this so that we could just slap the method into Program.cs in a simple console app to demonstrate.
Test and expected output
Here's a good test case, with the output it create below, written expecting to be the Main method in a simple console app's Program.cs.
static void Main(string[] args)
{
string testValue = "12345“”67890”";
for (int i = 0; i < 15; i++)
{
string cutValue = Program.CutToUTF8Length(testValue, i);
Console.WriteLine(i.ToString().PadLeft(2) +
": " + Encoding.UTF8.GetByteCount(cutValue).ToString().PadLeft(2) +
":: " + cutValue);
}
Console.WriteLine();
Console.WriteLine();
foreach (byte b in Encoding.UTF8.GetBytes(testValue))
{
Console.WriteLine(b.ToString().PadLeft(3) + " " + (char)b);
}
Console.WriteLine("Return to end.");
Console.ReadLine();
}
Output follows. Notice that the "smart quotes" in testValue are three bytes long in UTF8 (though when we write the chars to the console in ASCII, it outputs dumb quotes). Also note the ?s output for the second and third bytes of each smart quote in the output.
The first five characters of our testValue are single bytes in UTF8, so 0-5 byte values should be 0-5 characters. Then we have a three-byte smart quote, which can't be included in its entirety until 5 + 3 bytes. Sure enough, we see that pop out at the call for 8.Our next smart quote pops out at 8 + 3 = 11, and then we're back to single byte characters through 14.
0: 0::
1: 1:: 1
2: 2:: 12
3: 3:: 123
4: 4:: 1234
5: 5:: 12345
6: 5:: 12345
7: 5:: 12345
8: 8:: 12345"
9: 8:: 12345"
10: 8:: 12345"
11: 11:: 12345""
12: 12:: 12345""6
13: 13:: 12345""67
14: 14:: 12345""678
49 1
50 2
51 3
52 4
53 5
226 â
128 ?
156 ?
226 â
128 ?
157 ?
54 6
55 7
56 8
57 9
48 0
226 â
128 ?
157 ?
Return to end.
So that's kind of fun, and I'm in just before the question's five year anniversary. Though Oren's description of the bits had a small error, that's exactly the trick you want to use. Thanks for the question; neat.
Here are two possible solution - a LINQ one-liner processing the input left to right and a traditional for-loop processing the input from right to left. Which processing direction is faster depends on the string length, the allowed byte length, and the number and distribution of multibyte characters and is hard to give a general suggestion. The decision between LINQ and traditional code I probably a matter of taste (or maybe speed).
If speed matters, one could think about just accumulating the byte length of each character until reaching the maximum length instead of calculating the byte length of the whole string in each iteration. But I am not sure if this will work because I don't know UTF-8 encoding well enough. I could theoreticaly imagine that the byte length of a string does not equal the sum of the byte lengths of all characters.
public static String LimitByteLength(String input, Int32 maxLength)
{
return new String(input
.TakeWhile((c, i) =>
Encoding.UTF8.GetByteCount(input.Substring(0, i + 1)) <= maxLength)
.ToArray());
}
public static String LimitByteLength2(String input, Int32 maxLength)
{
for (Int32 i = input.Length - 1; i >= 0; i--)
{
if (Encoding.UTF8.GetByteCount(input.Substring(0, i + 1)) <= maxLength)
{
return input.Substring(0, i + 1);
}
}
return String.Empty;
}
Shorter version of ruffin's answer. Takes advantage of the design of UTF8:
public static string LimitUtf8ByteCount(this string s, int n)
{
// quick test (we probably won't be trimming most of the time)
if (Encoding.UTF8.GetByteCount(s) <= n)
return s;
// get the bytes
var a = Encoding.UTF8.GetBytes(s);
// if we are in the middle of a character (highest two bits are 10)
if (n > 0 && ( a[n]&0xC0 ) == 0x80)
{
// remove all bytes whose two highest bits are 10
// and one more (start of multi-byte sequence - highest bits should be 11)
while (--n > 0 && ( a[n]&0xC0 ) == 0x80)
;
}
// convert back to string (with the limit adjusted)
return Encoding.UTF8.GetString(a, 0, n);
}
All of the other answers appear to miss the fact that this functionality is already built into .NET, in the Encoder class. For bonus points, this approach will also work for other encodings.
public static string LimitByteLength(string message, int maxLength)
{
if (string.IsNullOrEmpty(message) || Encoding.UTF8.GetByteCount(message) <= maxLength)
{
return message;
}
var encoder = Encoding.UTF8.GetEncoder();
byte[] buffer = new byte[maxLength];
char[] messageChars = message.ToCharArray();
encoder.Convert(
chars: messageChars,
charIndex: 0,
charCount: messageChars.Length,
bytes: buffer,
byteIndex: 0,
byteCount: buffer.Length,
flush: false,
charsUsed: out int charsUsed,
bytesUsed: out int bytesUsed,
completed: out bool completed);
// I don't think we can return message.Substring(0, charsUsed)
// as that's the number of UTF-16 chars, not the number of codepoints
// (think about surrogate pairs). Therefore I think we need to
// actually convert bytes back into a new string
return Encoding.UTF8.GetString(buffer, 0, bytesUsed);
}
If you're using .NET Standard 2.1+, you can simplify it a bit:
public static string LimitByteLength(string message, int maxLength)
{
if (string.IsNullOrEmpty(message) || Encoding.UTF8.GetByteCount(message) <= maxLength)
{
return message;
}
var encoder = Encoding.UTF8.GetEncoder();
byte[] buffer = new byte[maxLength];
encoder.Convert(message.AsSpan(), buffer.AsSpan(), false, out _, out int bytesUsed, out _);
return Encoding.UTF8.GetString(buffer, 0, bytesUsed);
}
None of the other answers account for extended grapheme clusters, such as 👩🏽‍🚒. This is composed of 4 Unicode scalars (👩, 🏽, a zero-width joiner, and 🚒), so you need knowledge of the Unicode standard to avoid splitting it in the middle and producing 👩 or 👩🏽.
In .NET 5 onwards, you can write this as:
public static string LimitByteLength(string message, int maxLength)
{
if (string.IsNullOrEmpty(message) || Encoding.UTF8.GetByteCount(message) <= maxLength)
{
return message;
}
var enumerator = StringInfo.GetTextElementEnumerator(message);
var result = new StringBuilder();
int lengthBytes = 0;
while (enumerator.MoveNext())
{
lengthBytes += Encoding.UTF8.GetByteCount(enumerator.GetTextElement());
if (lengthBytes <= maxLength)
{
result.Append(enumerator.GetTextElement());
}
}
return result.ToString();
}
(This same code runs on earlier versions of .NET, but due to a bug it won't produce the correct result before .NET 5).
If a UTF-8 byte has a zero-valued high order bit, it's the beginning of a character. If its high order bit is 1, it's in the 'middle' of a character. The ability to detect the beginning of a character was an explicit design goal of UTF-8.
Check out the Description section of the wikipedia article for more detail.
Is there a reason that you need the database column to be declared in terms of bytes? That's the default, but it's not a particularly useful default if the database character set is variable width. I'd strongly prefer declaring the column in terms of characters.
CREATE TABLE length_example (
col1 VARCHAR2( 10 BYTE ),
col2 VARCHAR2( 10 CHAR )
);
This will create a table where COL1 will store 10 bytes of data and col2 will store 10 characters worth of data. Character length semantics make far more sense in a UTF8 database.
Assuming you want all the tables you create to use character length semantics by default, you can set the initialization parameter NLS_LENGTH_SEMANTICS to CHAR. At that point, any tables you create will default to using character length semantics rather than byte length semantics if you don't specify CHAR or BYTE in the field length.
Following Oren Trutner's comment here are two more solutions to the problem:
here we count the number of bytes to remove from the end of the string according to each character at the end of the string, so we don't evaluate the entire string in every iteration.
string str = "朣楢琴执执 瑩浻牡楧硰执执獧浻牡楧敬瑦 瀰 絸朣杢执獧扻捡杫潲湵 潣"
int maxBytesLength = 30;
var bytesArr = Encoding.UTF8.GetBytes(str);
int bytesToRemove = 0;
int lastIndexInString = str.Length -1;
while(bytesArr.Length - bytesToRemove > maxBytesLength)
{
bytesToRemove += Encoding.UTF8.GetByteCount(new char[] {str[lastIndexInString]} );
--lastIndexInString;
}
string trimmedString = Encoding.UTF8.GetString(bytesArr,0,bytesArr.Length - bytesToRemove);
//Encoding.UTF8.GetByteCount(trimmedString);//get the actual length, will be <= 朣楢琴执执 瑩浻牡楧硰执执獧浻牡楧敬瑦 瀰 絸朣杢执獧扻捡杫潲湵 潣潬昣昸昸慢正
And an even more efficient(and maintainable) solution:
get the string from the bytes array according to desired length and cut the last character because it might be corrupted
string str = "朣楢琴执执 瑩浻牡楧硰执执獧浻牡楧敬瑦 瀰 絸朣杢执獧扻捡杫潲湵 潣"
int maxBytesLength = 30;
string trimmedWithDirtyLastChar = Encoding.UTF8.GetString(Encoding.UTF8.GetBytes(str),0,maxBytesLength);
string trimmedString = trimmedWithDirtyLastChar.Substring(0,trimmedWithDirtyLastChar.Length - 1);
The only downside with the second solution is that we might cut a perfectly fine last character, but we are already cutting the string, so it might fit with the requirements.
Thanks to Shhade who thought about the second solution
This is another solution based on binary search:
public string LimitToUTF8ByteLength(string text, int size)
{
if (size <= 0)
{
return string.Empty;
}
int maxLength = text.Length;
int minLength = 0;
int length = maxLength;
while (maxLength >= minLength)
{
length = (maxLength + minLength) / 2;
int byteLength = Encoding.UTF8.GetByteCount(text.Substring(0, length));
if (byteLength > size)
{
maxLength = length - 1;
}
else if (byteLength < size)
{
minLength = length + 1;
}
else
{
return text.Substring(0, length);
}
}
// Round down the result
string result = text.Substring(0, length);
if (size >= Encoding.UTF8.GetByteCount(result))
{
return result;
}
else
{
return text.Substring(0, length - 1);
}
}
public static string LimitByteLength3(string input, Int32 maxLenth)
{
string result = input;
int byteCount = Encoding.UTF8.GetByteCount(input);
if (byteCount > maxLenth)
{
var byteArray = Encoding.UTF8.GetBytes(input);
result = Encoding.UTF8.GetString(byteArray, 0, maxLenth);
}
return result;
}

Categories