I'm trying to implement SQL Server Vardecimal decompression. Values stored as 3 digits decimals per every 10 bits. But during implementation I found strange behavior of math. Here is simple test I made
private SqlDecimal Test() {
SqlDecimal mantissa = 0;
SqlDecimal sign = -1;
byte exponent = 0x20;
int numDigits = 0;
// -999999999999999999999999999999999.99999
for (int i = 0; i < 13; i++) {
int temp = 999;
//equal to mantissa = mantissa * 1000 + temp;
numDigits += 3;
int pwr = exponent - (numDigits - 1);
mantissa += temp * (SqlDecimal)Math.Pow(10, pwr);
}
return sign * mantissa;
}
First 2 passes are fine, I have
999000000000000000000000000000000
999999000000000000000000000000000
but third have
999999998999999999999980020000000
Is it some bug in C# SqlDecimal math or am I doing something wrong?
This is an issue with how you're constructing the value to add here:
mantissa += temp * (SqlDecimal)Math.Pow(10, pwr);
The problem starts when pwr is 24. You can see this very clearly here:
Console.WriteLine((SqlDecimal) Math.Pow(10, 24));
The output on my box is:
999999999999999980000000
Now I don't know exactly where that's coming from - but it's simplest to remove the floating point arithmetic entirely. While it may not be efficient, this is a simple way of avoiding the problem:
static SqlDecimal PowerOfTen(int power)
{
// Note: only works for non-negative power values at the moment!
// (To handle negative input, divide by 10 on each iteration instead.)
SqlDecimal result = 1;
for (int i = 0; i < power; i++)
{
result = result * 10;
}
return result;
}
If you then change the line to:
mantissa += temp * PowerOfTen(pwr);
then you'll get the results you expect - at least while pwr is greater than zero. It should be easy to fix PowerOfTen to handle negative values as well though.
Update
Amending the below method to just work with Parse and ToString should improve performance for larger numbers (which would be the general use case for these types):
public static SqlDecimal ToSqlDecimal(this BigInteger bigint)
{
return SqlDecimal.Parse(bigint.ToString());
}
This trick also works for the double returned by the original Math.Pow call; so you could just do:
SqlDecimal.Parse(string.Format("{0:0}",Math.Pow(10,24)))
Original Answer
Obviously #JonSkeet's answer's best, as it only involves 24 iterations, vs potentially thousands in my attempt. However, here's an alternate solution, which may help out in other scenarios where you need to convert large integers (i.e. System.Numeric.BigInteger) to SqlDecimal / where performance is less of a concern.
Fiddle Example
//using System.Data.SqlTypes;
//using System.Numerics; //also needs an assembly reference to System.Numerics.dll
public static class BigIntegerExtensions
{
public static SqlDecimal ToSqlDecimal(this BigInteger bigint)
{
SqlDecimal result = 0;
var longMax = (SqlDecimal)long.MaxValue; //cache the converted value to minimise conversions
var longMin = (SqlDecimal)long.MinValue;
while (bigint > long.MaxValue)
{
result += longMax;
bigint -= long.MaxValue;
}
while (bigint < long.MinValue)
{
result += longMin;
bigint -= long.MinValue;
}
return result + (SqlDecimal)(long)bigint;
}
}
For your above use case, you could use this like so (uses the BigInteger.Pow method):
mantissa += temp * BigInteger.Pow(10, pwr).ToSqlDecimal();
I am using BigInteger.Parse(some string) but it takes forever and I'm not even sure if it finishes.
However, I can convert the huge string to a byte array and jam the byte array into a BigInteger constructor in very little time but it munges the original number stored in the string because of the endian issue with BigInteger and byte arrays.
Is there a way to convert the string to a byte array and put the byte array into the BigInteger object while preserving the original number stored in ASCII in the string?
String s = "12345"; // Some huge string, millions of digits.
BigInteger bi = new BigInteger(Encoding.ASCII.GetBytes(s); // very fast but the 12345 is lost
// OR...
BigInteger bi = BigInteger.Parse(s); // Takes forever therefore unuseable.
The byte[] representation of BigInteger has little to do with the ASCII characters. Much like the byte representation of an int has little to do with the ASCII representation of it.
To parse the number, each character must be converted to the digit value, and added to the previously parsed value multiplied by 10. That is probably why it's taking so long, and any version you write will probably not perform better. It has to do something like:
var nr=0;
foreach(var c in "123") nr=nr*10+(c-'0');
Edit
While it is not possible to perform the conversion by just converting to a byte array, the library implementation is slower then it has to be (at least for simple scenarios that do not need internationalization for example). Using the trick suggested by Rudy Velthuis in the comments and not taking into account hex formats or internationalization, I was able to produce a version which for 303104 characters runs ~5 times faster (from 18.2s to 3.75s. For 1 milion digits the fast method takes 47s, long, but it is a huge number):
public static class Helper
{
static BigInteger[] factors = Enumerable.Range(0, 19).Select(i=> BigInteger.Pow(10, i)).ToArray();
public static BigInteger ParseFast(string str)
{
var result = new BigInteger(0);
var n = str.Length;
var hasSgn = str[0] == '-';
int j;
for (var i = hasSgn ? 1 : 0; i < n; i += j - i)
{
long gr = 0;
for (j = i; j < i + 18 && j < n; j++)
{
gr = gr * 10 + (str[j] - '0');
}
result = result * factors[j-i]+ gr;
}
if (hasSgn)
{
result = BigInteger.MinusOne * result;
}
return result;
}
}
What I am looking for is something like PHPs decbin function in C#. That function converts decimals to its representation as a string.
For example, when using decbin(21) it returns 10101 as result.
I found this function which basically does what I want, but maybe there is a better / faster way?
var result = Convert.ToString(number, 2);
– Almost the only use for the (otherwise useless) Convert class.
Most ways will be better and faster than the function that you found. It's not a very good example on how to do the conversion.
The built in method Convert.ToString(num, base) is an obvious choice, but you can easily write a replacement if you need it to work differently.
This is a simple method where you can specify the length of the binary number:
public static string ToBin(int value, int len) {
return (len > 1 ? ToBin(value >> 1, len - 1) : null) + "01"[value & 1];
}
It uses recursion, the first part (before the +) calls itself to create the binary representation of the number except for the last digit, and the second part takes care of the last digit.
Example:
Console.WriteLine(ToBin(42, 8));
Output:
00101010
int toBase = 2;
string binary = Convert.ToString(21, toBase); // "10101"
To have the binary value in (at least) a specified number of digits, padded with zeroes:
string bin = Convert.ToString(1234, 2).PadLeft(16, '0');
The Convert.ToString does the conversion to a binary string.
The PadLeft adds zeroes to fill it up to 16 digits.
This is my answer:
static bool[] Dec2Bin(int value)
{
if (value == 0) return new[] { false };
var n = (int)(Math.Log(value) / Math.Log(2));
var a = new bool[n + 1];
for (var i = n; i >= 0; i--)
{
n = (int)Math.Pow(2, i);
if (n > value) continue;
a[i] = true;
value -= n;
}
Array.Reverse(a);
return a;
}
Using Pow instead of modulo and divide so i think it's faster way.
A recent project called for importing data into an Oracle database. The program that will do this is a C# .Net 3.5 app and I'm using the Oracle.DataAccess connection library to handle the actual inserting.
I ran into a problem where I'd receive this error message when inserting a particular field:
ORA-12899 Value too large for column X
I used Field.Substring(0, MaxLength); but still got the error (though not for every record).
Finally I saw what should have been obvious, my string was in ANSI and the field was UTF8. Its length is defined in bytes, not characters.
This gets me to my question. What is the best way to trim my string to fix the MaxLength?
My substring code works by character length. Is there simple C# function that can trim a UT8 string intelligently by byte length (ie not hack off half a character) ?
I think we can do better than naively counting the total length of a string with each addition. LINQ is cool, but it can accidentally encourage inefficient code. What if I wanted the first 80,000 bytes of a giant UTF string? That's a lot of unnecessary counting. "I've got 1 byte. Now I've got 2. Now I've got 13... Now I have 52,384..."
That's silly. Most of the time, at least in l'anglais, we can cut exactly on that nth byte. Even in another language, we're less than 6 bytes away from a good cutting point.
So I'm going to start from #Oren's suggestion, which is to key off of the leading bit of a UTF8 char value. Let's start by cutting right at the n+1th byte, and use Oren's trick to figure out if we need to cut a few bytes earlier.
Three possibilities
If the first byte after the cut has a 0 in the leading bit, I know I'm cutting precisely before a single byte (conventional ASCII) character, and can cut cleanly.
If I have a 11 following the cut, the next byte after the cut is the start of a multi-byte character, so that's a good place to cut too!
If I have a 10, however, I know I'm in the middle of a multi-byte character, and need to go back to check to see where it really starts.
That is, though I want to cut the string after the nth byte, if that n+1th byte comes in the middle of a multi-byte character, cutting would create an invalid UTF8 value. I need to back up until I get to one that starts with 11 and cut just before it.
Code
Notes: I'm using stuff like Convert.ToByte("11000000", 2) so that it's easy to tell what bits I'm masking (a little more about bit masking here). In a nutshell, I'm &ing to return what's in the byte's first two bits and bringing back 0s for the rest. Then I check the XX from XX000000 to see if it's 10 or 11, where appropriate.
I found out today that C# 6.0 might actually support binary representations, which is cool, but we'll keep using this kludge for now to illustrate what's going on.
The PadLeft is just because I'm overly OCD about output to the Console.
So here's a function that'll cut you down to a string that's n bytes long or the greatest number less than n that's ends with a "complete" UTF8 character.
public static string CutToUTF8Length(string str, int byteLength)
{
byte[] byteArray = Encoding.UTF8.GetBytes(str);
string returnValue = string.Empty;
if (byteArray.Length > byteLength)
{
int bytePointer = byteLength;
// Check high bit to see if we're [potentially] in the middle of a multi-byte char
if (bytePointer >= 0
&& (byteArray[bytePointer] & Convert.ToByte("10000000", 2)) > 0)
{
// If so, keep walking back until we have a byte starting with `11`,
// which means the first byte of a multi-byte UTF8 character.
while (bytePointer >= 0
&& Convert.ToByte("11000000", 2) != (byteArray[bytePointer] & Convert.ToByte("11000000", 2)))
{
bytePointer--;
}
}
// See if we had 1s in the high bit all the way back. If so, we're toast. Return empty string.
if (0 != bytePointer)
{
returnValue = Encoding.UTF8.GetString(byteArray, 0, bytePointer); // hat tip to #NealEhardt! Well played. ;^)
}
}
else
{
returnValue = str;
}
return returnValue;
}
I initially wrote this as a string extension. Just add back the this before string str to put it back into extension format, of course. I removed the this so that we could just slap the method into Program.cs in a simple console app to demonstrate.
Test and expected output
Here's a good test case, with the output it create below, written expecting to be the Main method in a simple console app's Program.cs.
static void Main(string[] args)
{
string testValue = "12345“”67890”";
for (int i = 0; i < 15; i++)
{
string cutValue = Program.CutToUTF8Length(testValue, i);
Console.WriteLine(i.ToString().PadLeft(2) +
": " + Encoding.UTF8.GetByteCount(cutValue).ToString().PadLeft(2) +
":: " + cutValue);
}
Console.WriteLine();
Console.WriteLine();
foreach (byte b in Encoding.UTF8.GetBytes(testValue))
{
Console.WriteLine(b.ToString().PadLeft(3) + " " + (char)b);
}
Console.WriteLine("Return to end.");
Console.ReadLine();
}
Output follows. Notice that the "smart quotes" in testValue are three bytes long in UTF8 (though when we write the chars to the console in ASCII, it outputs dumb quotes). Also note the ?s output for the second and third bytes of each smart quote in the output.
The first five characters of our testValue are single bytes in UTF8, so 0-5 byte values should be 0-5 characters. Then we have a three-byte smart quote, which can't be included in its entirety until 5 + 3 bytes. Sure enough, we see that pop out at the call for 8.Our next smart quote pops out at 8 + 3 = 11, and then we're back to single byte characters through 14.
0: 0::
1: 1:: 1
2: 2:: 12
3: 3:: 123
4: 4:: 1234
5: 5:: 12345
6: 5:: 12345
7: 5:: 12345
8: 8:: 12345"
9: 8:: 12345"
10: 8:: 12345"
11: 11:: 12345""
12: 12:: 12345""6
13: 13:: 12345""67
14: 14:: 12345""678
49 1
50 2
51 3
52 4
53 5
226 â
128 ?
156 ?
226 â
128 ?
157 ?
54 6
55 7
56 8
57 9
48 0
226 â
128 ?
157 ?
Return to end.
So that's kind of fun, and I'm in just before the question's five year anniversary. Though Oren's description of the bits had a small error, that's exactly the trick you want to use. Thanks for the question; neat.
Here are two possible solution - a LINQ one-liner processing the input left to right and a traditional for-loop processing the input from right to left. Which processing direction is faster depends on the string length, the allowed byte length, and the number and distribution of multibyte characters and is hard to give a general suggestion. The decision between LINQ and traditional code I probably a matter of taste (or maybe speed).
If speed matters, one could think about just accumulating the byte length of each character until reaching the maximum length instead of calculating the byte length of the whole string in each iteration. But I am not sure if this will work because I don't know UTF-8 encoding well enough. I could theoreticaly imagine that the byte length of a string does not equal the sum of the byte lengths of all characters.
public static String LimitByteLength(String input, Int32 maxLength)
{
return new String(input
.TakeWhile((c, i) =>
Encoding.UTF8.GetByteCount(input.Substring(0, i + 1)) <= maxLength)
.ToArray());
}
public static String LimitByteLength2(String input, Int32 maxLength)
{
for (Int32 i = input.Length - 1; i >= 0; i--)
{
if (Encoding.UTF8.GetByteCount(input.Substring(0, i + 1)) <= maxLength)
{
return input.Substring(0, i + 1);
}
}
return String.Empty;
}
Shorter version of ruffin's answer. Takes advantage of the design of UTF8:
public static string LimitUtf8ByteCount(this string s, int n)
{
// quick test (we probably won't be trimming most of the time)
if (Encoding.UTF8.GetByteCount(s) <= n)
return s;
// get the bytes
var a = Encoding.UTF8.GetBytes(s);
// if we are in the middle of a character (highest two bits are 10)
if (n > 0 && ( a[n]&0xC0 ) == 0x80)
{
// remove all bytes whose two highest bits are 10
// and one more (start of multi-byte sequence - highest bits should be 11)
while (--n > 0 && ( a[n]&0xC0 ) == 0x80)
;
}
// convert back to string (with the limit adjusted)
return Encoding.UTF8.GetString(a, 0, n);
}
All of the other answers appear to miss the fact that this functionality is already built into .NET, in the Encoder class. For bonus points, this approach will also work for other encodings.
public static string LimitByteLength(string message, int maxLength)
{
if (string.IsNullOrEmpty(message) || Encoding.UTF8.GetByteCount(message) <= maxLength)
{
return message;
}
var encoder = Encoding.UTF8.GetEncoder();
byte[] buffer = new byte[maxLength];
char[] messageChars = message.ToCharArray();
encoder.Convert(
chars: messageChars,
charIndex: 0,
charCount: messageChars.Length,
bytes: buffer,
byteIndex: 0,
byteCount: buffer.Length,
flush: false,
charsUsed: out int charsUsed,
bytesUsed: out int bytesUsed,
completed: out bool completed);
// I don't think we can return message.Substring(0, charsUsed)
// as that's the number of UTF-16 chars, not the number of codepoints
// (think about surrogate pairs). Therefore I think we need to
// actually convert bytes back into a new string
return Encoding.UTF8.GetString(buffer, 0, bytesUsed);
}
If you're using .NET Standard 2.1+, you can simplify it a bit:
public static string LimitByteLength(string message, int maxLength)
{
if (string.IsNullOrEmpty(message) || Encoding.UTF8.GetByteCount(message) <= maxLength)
{
return message;
}
var encoder = Encoding.UTF8.GetEncoder();
byte[] buffer = new byte[maxLength];
encoder.Convert(message.AsSpan(), buffer.AsSpan(), false, out _, out int bytesUsed, out _);
return Encoding.UTF8.GetString(buffer, 0, bytesUsed);
}
None of the other answers account for extended grapheme clusters, such as 👩🏽🚒. This is composed of 4 Unicode scalars (👩, 🏽, a zero-width joiner, and 🚒), so you need knowledge of the Unicode standard to avoid splitting it in the middle and producing 👩 or 👩🏽.
In .NET 5 onwards, you can write this as:
public static string LimitByteLength(string message, int maxLength)
{
if (string.IsNullOrEmpty(message) || Encoding.UTF8.GetByteCount(message) <= maxLength)
{
return message;
}
var enumerator = StringInfo.GetTextElementEnumerator(message);
var result = new StringBuilder();
int lengthBytes = 0;
while (enumerator.MoveNext())
{
lengthBytes += Encoding.UTF8.GetByteCount(enumerator.GetTextElement());
if (lengthBytes <= maxLength)
{
result.Append(enumerator.GetTextElement());
}
}
return result.ToString();
}
(This same code runs on earlier versions of .NET, but due to a bug it won't produce the correct result before .NET 5).
If a UTF-8 byte has a zero-valued high order bit, it's the beginning of a character. If its high order bit is 1, it's in the 'middle' of a character. The ability to detect the beginning of a character was an explicit design goal of UTF-8.
Check out the Description section of the wikipedia article for more detail.
Is there a reason that you need the database column to be declared in terms of bytes? That's the default, but it's not a particularly useful default if the database character set is variable width. I'd strongly prefer declaring the column in terms of characters.
CREATE TABLE length_example (
col1 VARCHAR2( 10 BYTE ),
col2 VARCHAR2( 10 CHAR )
);
This will create a table where COL1 will store 10 bytes of data and col2 will store 10 characters worth of data. Character length semantics make far more sense in a UTF8 database.
Assuming you want all the tables you create to use character length semantics by default, you can set the initialization parameter NLS_LENGTH_SEMANTICS to CHAR. At that point, any tables you create will default to using character length semantics rather than byte length semantics if you don't specify CHAR or BYTE in the field length.
Following Oren Trutner's comment here are two more solutions to the problem:
here we count the number of bytes to remove from the end of the string according to each character at the end of the string, so we don't evaluate the entire string in every iteration.
string str = "朣楢琴执执 瑩浻牡楧硰执执獧浻牡楧敬瑦 瀰 絸朣杢执獧扻捡杫潲湵 潣"
int maxBytesLength = 30;
var bytesArr = Encoding.UTF8.GetBytes(str);
int bytesToRemove = 0;
int lastIndexInString = str.Length -1;
while(bytesArr.Length - bytesToRemove > maxBytesLength)
{
bytesToRemove += Encoding.UTF8.GetByteCount(new char[] {str[lastIndexInString]} );
--lastIndexInString;
}
string trimmedString = Encoding.UTF8.GetString(bytesArr,0,bytesArr.Length - bytesToRemove);
//Encoding.UTF8.GetByteCount(trimmedString);//get the actual length, will be <= 朣楢琴执执 瑩浻牡楧硰执执獧浻牡楧敬瑦 瀰 絸朣杢执獧扻捡杫潲湵 潣潬昣昸昸慢正
And an even more efficient(and maintainable) solution:
get the string from the bytes array according to desired length and cut the last character because it might be corrupted
string str = "朣楢琴执执 瑩浻牡楧硰执执獧浻牡楧敬瑦 瀰 絸朣杢执獧扻捡杫潲湵 潣"
int maxBytesLength = 30;
string trimmedWithDirtyLastChar = Encoding.UTF8.GetString(Encoding.UTF8.GetBytes(str),0,maxBytesLength);
string trimmedString = trimmedWithDirtyLastChar.Substring(0,trimmedWithDirtyLastChar.Length - 1);
The only downside with the second solution is that we might cut a perfectly fine last character, but we are already cutting the string, so it might fit with the requirements.
Thanks to Shhade who thought about the second solution
This is another solution based on binary search:
public string LimitToUTF8ByteLength(string text, int size)
{
if (size <= 0)
{
return string.Empty;
}
int maxLength = text.Length;
int minLength = 0;
int length = maxLength;
while (maxLength >= minLength)
{
length = (maxLength + minLength) / 2;
int byteLength = Encoding.UTF8.GetByteCount(text.Substring(0, length));
if (byteLength > size)
{
maxLength = length - 1;
}
else if (byteLength < size)
{
minLength = length + 1;
}
else
{
return text.Substring(0, length);
}
}
// Round down the result
string result = text.Substring(0, length);
if (size >= Encoding.UTF8.GetByteCount(result))
{
return result;
}
else
{
return text.Substring(0, length - 1);
}
}
public static string LimitByteLength3(string input, Int32 maxLenth)
{
string result = input;
int byteCount = Encoding.UTF8.GetByteCount(input);
if (byteCount > maxLenth)
{
var byteArray = Encoding.UTF8.GetBytes(input);
result = Encoding.UTF8.GetString(byteArray, 0, maxLenth);
}
return result;
}
The number is bigger than int & long but can be accomodated in Decimal. But the normal ToString or Convert methods don't work on Decimal.
I believe this will produce the right results where it returns anything, but may reject valid integers. I dare say that can be worked around with a bit of effort though... (Oh, and it will also fail for negative numbers at the moment.)
static string ConvertToHex(decimal d)
{
int[] bits = decimal.GetBits(d);
if (bits[3] != 0) // Sign and exponent
{
throw new ArgumentException();
}
return string.Format("{0:x8}{1:x8}{2:x8}",
(uint)bits[2], (uint)bits[1], (uint)bits[0]);
}
Do it manually!
http://www.permadi.com/tutorial/numDecToHex/
I've got to agree with James - do it manually - but don't use base-16. Use base 2^32, and print 8 hex digits at a time.
I guess one option would be to keep taking chunks off it, and converting individual chunks? A bit of mod/division etc, converting individual fragments...
So: what hex value do you expect?
Here's two approaches... one uses the binary structure of decimal; one does it manually. In reality, you might want to have a test: if bits[3] is zero, do it the quick way, otherwise do it manually.
decimal d = 588063595292424954445828M;
int[] bits = decimal.GetBits(d);
if (bits[3] != 0) throw new InvalidOperationException("Only +ve integers supported!");
string s = Convert.ToString(bits[2], 16).PadLeft(8,'0') // high
+ Convert.ToString(bits[1], 16).PadLeft(8, '0') // middle
+ Convert.ToString(bits[0], 16).PadLeft(8, '0'); // low
Console.WriteLine(s);
/* or Jon's much tidier: string.Format("{0:x8}{1:x8}{2:x8}",
(uint)bits[2], (uint)bits[1], (uint)bits[0]); */
const decimal chunk = (decimal)(1 << 16);
StringBuilder sb = new StringBuilder();
while (d > 0)
{
int fragment = (int) (d % chunk);
sb.Insert(0, Convert.ToString(fragment, 16).PadLeft(4, '0'));
d -= fragment;
d /= chunk;
}
Console.WriteLine(sb);