What is the difference between Convert.ToBase64String(byte[]) and HttpServerUtility.UrlTokenEncode(byte[])? - c#

I'm trying to remove a dependence on System.Web.dll from a Web API project, but have stumbled on a call to HttpServerUtility.UrlTokenEncode(byte[] input) (and its corresponding decode method) that I don't know what to replace with to ensure backwards compatibility. The documentation says that this method
Encodes a byte array into its equivalent string representation using base 64 digits, which is usable for transmission on the URL.
I tried substituting with Convert.ToBase64String(byte[] input) (and its corresponding decode method), which is very similarly described in the docs:
Converts an array of 8-bit unsigned integers to its equivalent string representation that is encoded with base-64 digits.
However, they don't seem to be entirely equivalent; when using Convert.FromBase64String(string input) to decode a string encoded with HttpServerUtility, I get an exception stating
The input is not a valid Base-64 string as it contains a non-base 64 character, more than two padding characters, or an illegal character among the padding characters.
What is the difference between these two conversion utilities? What's the correct way to remove this dependence on System.Web.HttpServerUtility?
Some users have suggested that this is a duplicate of this one, but I disagree. That question is about base-64-encoding a string in a url-safe manner in general, but I need to reproduce the exact behavior of HttpServerUtility but without a dependency on System.Web.

I took DGibbs on their word and Used the Source. It turns out the following happens in the HttpServerUtility methods:
Encoding to Base64
Use System.Convert to convert the input to Base64.
Replace + by - and / by _. Example: Foo+bar/=== becomes Foo-bar_===.
Replace any number of = at the end of the string, with an integer denoting how many they were. Example: Foo-bar_=== becomes Foo-bar_3.
Decoding from Base64
Replace the digit at the end of the string by the same number of = signs. Example: Foo-bar_3 becomes Foo-bar_===.
Replace - by + and _ by /. Example: Foo-bar_=== becomes Foo+bar/===.
Use System.Convert to decode the preprocessed input from Base64.

HttpServerUtility.UrlTokenEncode(byte[] input) will encode a URL safe Base64 string. In Base64 +, / and = characters are valid, but they are not URL safe, this method will replace these characters whereas the Convert.ToBase64String(byte[] input) will not. You can probably drop the reference and do it yourself.
Usually, '+' is replaced with '-' and '/' with '_' padding '=' is just removed.
Accepted answer here gives a code example: How to achieve Base64 URL safe encoding in C#?

Related

How to Determine Unicode Characters from a UTF-16 String?

I have string that contains an odd Unicode space character, but I'm not sure what character that is. I understand that in C# a string in memory is encoded using the UTF-16 format. What is a good way to determine which Unicode characters make up the string?
This question was marked as a possible duplicate to
Determine a string's encoding in C#
It's not a duplicate of this question because I'm not asking about what the encoding is. I already know that a string in C# is encoded as UTF-16. I'm just asking for an easy way to determine what the Unicode values are in the string.
The BMP characters are up to 2 bytes in length (values 0x0000-0xffff), so there's a good bit of coverage there. Characters from the Chinese, Thai, even Mongolian alphabets are there, so if you're not an encoding expert, you might be forgiven if your code only handles BMP characters. But all the same, characters like present here http://www.fileformat.info/info/unicode/char/10330/index.htm won't be correctly handled by code that assumes it'll fit into two bytes.
Unicode seems to identify characters as numeric code points. Not all code points actually refer to characters, however, because Unicode has the concept of combining characters (which I don’t know much about). However, each Unicode string, even some invalid ones (e.g., illegal sequence of combining characters), can be thought of as a list of code points (numbers).
In the UTF-16 encoding, each code point is encoded as a 2 or 4 byte sequence. In .net, Char might roughly correspond to either a 2 byte UTF-16 sequence or half of a 4 byte UTF-16 sequence. When Char contains half of a 4 byte sequence, it is considered a “surrogate” because it only has meaning when combined with another Char which it must be kept with. To get started with inspecting your .net string, you can get .net to tell you the code points contained in the string, automatically combining surrogate pairs together if necessary. .net provides Char.ConvertToUtf32 which is described the following way:
Converts the value of a UTF-16 encoded character or surrogate pair at a specified position in a string into a Unicode code point.
The documentation for Char.ConvertToUtf32(String s, Int32 index) states that an ArgumentException is thrown for the following case:
The specified index position contains a surrogate pair, and either the first character in the pair is not a valid high surrogate or the second character in the pair is not a valid low surrogate.
Thus, you can go character by character in a string and find all of the Unicode code points with the help of Char.IsHighSurrogate() and Char.ConvertToUtf32(). When you don’t encounter a high surrogate, the current character fits in one Char and you only need to advance one Char in your string. If you do encounter a high surrogate, the character requires two Char and you need to advance by two:
static IEnumerable<int> GetCodePoints(string s)
{
for (var i = 0; i < s.Length; i += char.IsHighSurrogate(s[i]) ? 2 : 1)
{
yield return char.ConvertToUtf32(s, i);
}
}
When you say “from a UTF-16 String”, that might imply that you have read in a series of bytes formatted as UTF-16. If that is the case, you would need to convert that to a .net string before passing to the above method:
GetCodePoints(Encoding.UTF16.GetString(myUtf16Blob));
Another note: depending on how you build your String instance, it is possible that it contains an illegal sequence of Char with regards to surrogate pairs. For such strings, Char.ConvertToUtf32() will throw an exception when encountered. However, I think that Encoding.GetString() will always either return a valid string or throw an exception. So, generally, as long as your String instances are from “good” sources, you needn’t worry about Char.ConvertToUtf32() throwing (unless you pass in random values for the index offset because your offset might be in the middle of a surrogate pair).

Decoding Base64urlUInt-encoded value

What I am generally trying to do, is to validate an id_token value obtained from an OpenID Connect provider (e.g. Google). The token is signed with the RSA algorithm and the public key is read from the Discovery document (the jwks_uri parameter). For example, Google keys are available here in the JWK format:
{
kty: "RSA",
alg: "RS256",
use: "sig",
kid: "38d516cbe31d4345819b786d4d227e3075df02fc",
n: "4fQxF6dFabDqsz9a9-XgVhDaadTBO4yBZkpUyUKrS98ZtpKIQRMLoph3bK9Cua828wwDZ9HHhUxOcbcUiNDUbubtsDz1AirWpCVRRauxRdRInejbGSqHMbg1bxWYfquKKQwF7WnrrSbgdInUZPv5xcHEjQ6q_Kbcsts1Nnc__8YRdmIGrtdTAcm1Ga8LfwroeyiF-2xn0mtWDnU7rblQI4qaXCwM8Zm-lUrpSUkO6E1RTJ1L0vRx8ieyLLOBzJNwxpIBNFolMK8-DYXDSX0SdR7gslInKCn8Ihd9mpI2QBuT-KFUi88t8TW4LsoWHAwlgXCRGP5cYB4r30NQ1wMiuQ",
e: "AQAB"
}
I am going to use the RSACryptoServiceProvider class for decoding the signature. To initialize it, I have to provide RSAParameters with the Modulus and Exponent values. These values are read from the above JWK as n and e correspondingly. According to the specification, these values are Base64urlUInt-encoded values:
The representation of a positive or zero integer value as the
base64url encoding of the value's unsigned big-endian representation
as an octet sequence. The octet sequence MUST utilize the minimum
number of octets needed to represent the value. Zero is represented
as BASE64URL(single zero-valued octet), which is "AA".
So, my question is how to decode these values to put them to RSAParameters? I tried decoding them as a common Base64url string (Convert.FromBase64String(modulusRaw)), but this obviously does not work and generates this error:
The input is not a valid Base-64 string as it contains a non-base 64
character, more than two padding characters, or an illegal character
among the padding characters.
RFC 7515 defines base64url encoding like this:
Base64 encoding using the URL- and filename-safe character set
defined in Section 5 of RFC 4648, with all trailing '='
characters omitted (as permitted by Section 3.2) and without the
inclusion of any line breaks, whitespace, or other additional
characters. Note that the base64url encoding of the empty octet
sequence is the empty string. (See Appendix C for notes on
implementing base64url encoding without padding.)
RFC 4648 defines "Base 64 Encoding with URL and Filename Safe Alphabet" as regular base64, but:
The padding may be omitted (as it is here)
Using - instead of + and _ instead of /
So to use regular Convert.FromBase64String, you just need to reverse that process:
static byte[] FromBase64Url(string base64Url)
{
string padded = base64Url.Length % 4 == 0
? base64Url : base64Url + "====".Substring(base64Url.Length % 4);
string base64 = padded.Replace("_", "/")
.Replace("-", "+");
return Convert.FromBase64String(base64);
}
It's possible that this code already exists somewhere in the framework, but I'm not aware of it.
Who ever comes here from Java: there are two methods in java.util.Base64:
getDecoder()
getUrlDecoder()
As you probably assumed: taking the second one does all the chars replacements for you already.

How do I create a string with a surrogate pair inside of it?

I saw this post on Jon Skeet's blog where he talks about string reversing. I wanted to try the example he showed myself, but it seems to work... which leads me to believe that I have no idea how to create a string that contains a surrogate pair which will actually cause the string reversal to fail. How does one actually go about creating a string with a surrogate pair in it so that I can see the failure myself?
The simplest way is to use \U######## where the U is capital, and the # denote exactly eight hexadecimal digits. If the value exceeds 0000FFFF hexadecimal, a surrogate pair will be needed:
string myString = "In the game of mahjong \U0001F01C denotes the Four of circles";
You can check myString.Length to see that the one Unicode character occupies two .NET Char values. Note that the char type has a couple of static methods that will help you determine if a char is a part of a surrogate pair.
If you use a .NET language that does not have something like the \U######## escape sequence, you can use the method ConvertFromUtf32, for example:
string fourCircles = char.ConvertFromUtf32(0x1F01C);
Addition: If your C# source file has an encoding that allows all Unicode characters, like UTF-8, you can just put the charater directly in the file (by copy-paste). For example:
string myString = "In the game of mahjong 🀜 denotes the Four of circles";
The character is UTF-8 encoded in the source file (in my example) but will be UTF-16 encoded (surrogate pairs) when the application runs and the string is in memory.
(Not sure if Stack Overflow software handles my mahjong character correctly. Try clicking "edit" to this answer and copy-paste from the text there, if the "funny" character is not here.)
The term "surrogate pair" refers to a means of encoding Unicode characters with high code-points in the UTF-16 encoding scheme (see this page for more information);
In the Unicode character encoding, characters are mapped to values between 0x000000 and 0x10FFFF. Internally, a UTF-16 encoding scheme is used to store strings of Unicode text in which two-byte (16-bit) code sequences are considered. Since two bytes can only contain the range of characters from 0x0000 to 0xFFFF, some additional complexity is used to store values above this range (0x010000 to 0x10FFFF).
This is done using pairs of code points known as surrogates. The surrogate characters are classified in two distinct ranges known as low surrogates and high surrogates, depending on whether they are allowed at the start or the end of the two-code sequence.
Try this yourself:
String surrogate = "abc" + Char.ConvertFromUtf32(Int32.Parse("2A601", NumberStyles.HexNumber)) + "def";
Char[] surrogateArray = surrogate.ToCharArray();
Array.Reverse(surrogateArray);
String surrogateReversed = new String(surrogateArray);
or this, if you want to stick with the blog example:
String surrogate = "Les Mise" + Char.ConvertFromUtf32(Int32.Parse("0301", NumberStyles.HexNumber)) + "rables";
Char[] surrogateArray = surrogate.ToCharArray();
Array.Reverse(surrogateArray);
String surrogateReversed = new String(surrogateArray);
nnd then check the string values with the debugger. Jon Skeet is damn right... strings and dates seem easy but they are absolutely NOT.

Using a Regex to clean string versus Base64 Encoded string

I have a extension method that is using a Regex.Replace to clean up invalid characters in an user-entered string before it is added to a XML document.
The intent of the regex is to strip out some random hi-ASCII characters that are occasionally in the input when the user pastes text from Microsoft Word and replace them with a space:
public static string CleanInput(this string inputString) {
if (string.IsNullOrEmpty(inputString))
return string.Empty;
// Replace invalid characters with a space.
return Regex.Replace(inputString, #"[^\w\.#-]", " ");
}
Now as fate would have it, someone is now using this extension method on a string that contains base64-encoded data.
What I believe is that the regex will leave MOST of the base64 data unmodified, however I think it is might be changing some of it.
So - knowing that \w in the regex is matching [A-Za-z0-9_] and that Base64 effectively the same range, should this regex be changing the string or not?
If it is changing the string, why and how would you change it so that hi-ASCII garbage is still cleaned up in regular non-encoded text without mucking up the encoded string.
Base64 also uses +,/, and =.
You can add these to your character class:
[^\w\.#+/=-]
Note that - has to be last in order for it to be a literal hyphen-minus instead of specifying a range.
It may also be worth considering that \w isn't necessarily the same as [A-Za-z0-9_] according to Microsoft.

How to recognize if a string contains unicode chars?

I have a string and I want to know if it has unicode characters inside or not.
(if its fully contains ASCII or not)
How can I achieve that?
Thanks!
If my assumptions are correct you wish to know if your string contains any "non-ANSI" characters. You can derive this as follows.
public void test()
{
const string WithUnicodeCharacter = "a hebrew character:\uFB2F";
const string WithoutUnicodeCharacter = "an ANSI character:Æ";
bool hasUnicode;
//true
hasUnicode = ContainsUnicodeCharacter(WithUnicodeCharacter);
Console.WriteLine(hasUnicode);
//false
hasUnicode = ContainsUnicodeCharacter(WithoutUnicodeCharacter);
Console.WriteLine(hasUnicode);
}
public bool ContainsUnicodeCharacter(string input)
{
const int MaxAnsiCode = 255;
return input.Any(c => c > MaxAnsiCode);
}
Update
This will detect for extended ASCII. If you only detect for the true ASCII character range (up to 127), then you could potentially get false positives for extended ASCII characters which does not denote Unicode. I have alluded to this in my sample.
If a string contains only ASCII characters, a serialization + deserialization step using ASCII encoding should get back the same string
so a one liner check in c# could look like..
String s1="testभारत";
bool isUnicode= System.Text.ASCIIEncoding.GetEncoding(0).GetString(System.Text.ASCIIEncoding.GetEncoding(0).GetBytes(s1)) != s1;
ASCII defines only character codes in the range 0-127. Unicode is explicitly defined such as to overlap in that same range with ASCII. Thus, if you look at the character codes in your string, and it contains anything that is higher than 127, the string contains Unicode characters that are not ASCII characters.
Note, that ASCII includes only the English alphabet. Thus, if you (for whatever reason) need to apply that same approach to strings that might contain accented characters (Spanish text for example), ASCII is not sufficient and you need to look for another differentiator.
ANSI character set [*] does extends the ASCII characters with the aforementioned accented Latin characters in the range 128-255. However, Unicode does not overlap with ANSI in that range, so technically an Unicode string might contain characters that are not part of ANSI, but have the same character code (specifically in the range 128-159, as you can see from the table I linked to).
As for the actual code to do this, #chibacity answer should work, although you should modify it to cover strict ASCII, because it won't work for ANSI.
[*] Also known as Latin 1 Windows (Win-1252)
As long as it contains characters, it contains Unicode characters.
From System.String:
Represents text as a series of Unicode
characters.
public static bool ContainsUnicodeChars(string text)
{
return !string.IsNullOrEmpty(text);
}
You normally have to worry about different Unicode encodings when you have to:
Encode a string into a stream of bytes with a particular encoding.
Decode a string from a stream of bytes with a particular encoding.
Once you're into string land though, the encoding that the string was originally represented with, if any, is irrelevant.
Each character in a string is defined
by a Unicode scalar value, also called
a Unicode code point or the ordinal
(numeric) value of the Unicode
character. Each code point is encoded
by using UTF-16 encoding, and the
numeric value of each element of the
encoding is represented by a Char
object.
Perhaps you might also find these questions relevant:
How can you strip non-ASCII characters from a string? (in C#)
C# Ensure string contains only ASCII
And this article by Jon Skeet: Unicode and .NET
This is another solution without using lambda expresions. It is in VB.NET but you can convert it easily to C#:
Public Function ContainsUnicode(ByVal inputstr As String) As Boolean
Dim inputCharArray() As Char = inputstr.ToCharArray
For i As Integer = 0 To inputCharArray.Length - 1
If CInt(AscW(inputCharArray(i))) > 255 Then Return True
Next
Return False
End Function

Categories