This question may reveal my ignorance regarding character encoding, so if it does, I would greatly appreciate information to correct that.
I am relaying strings from new applications to an old application. The old application only accepts ASCII characters (http://www.asciitable.com/). The old application also does not support certain characters such as backslashes. The new applications support more or less anything.
Let's say I have the string:
"Whatever - 1_夜_💦💦💦"
I need to convert that to something with only ASCII characters. For example, maybe something like:
"Whatever - 1_\u001cY_=???=???=???"
Then I want to replace the remaining illegal characters with substitution strings.
Ideally, any character that is encoded to ASCII should be able to be de-coded. That is, any unique input string will have a unique output string (no arbitrary inputs "abc" and "xyz" which are different produce the same result). An algorithm could convert the output string back to the input string.
This is what I've tried:
static string ConvertToAscii(string str)
{
var return_string = "";
foreach (var c in str)
{
if ((int)c < 128)
{
return_string += c;
}
else
{
var charBytes = BitConverter.GetBytes(c);
var ascii = Encoding.ASCII.GetString(charBytes);
return_string += ascii;
}
}
return return_string;
}
When I use this with the string I mentioned above, I get:
"Whatever - 1_\u001cY_=???=???=???"
That seems great - however, the "\u001cY" is apparently a single character, rather than a collection of ASCII characters. So my target database rejects it, and I am not able to figure out how to remove the "\" while leaving the remaining characters.
How can I convert any string into a collection of ASCII characters?
The easiest approach is Base64 all bytes since you don't seem to care how strings are represented:
Convert.ToBase64String( Encoding.Unicode.GetBytes("Whatever - 1_夜_💦💦💦"))
will produce result that is guaranteed to be ASCII (even printable ASCII) - for your string result would be "VwBoAGEAdABlAHYAZQByACAALQAgADEAXwAcWV8APdim3D3Yptw92Kbc".
Here is similar code to what I ended up using to convert everything to Ascii:
internal static string ConvertToAscii(string str)
{
var returnStringBuilder = new StringBuilder();
foreach (var c in str)
{
if (char.IsControl(c))
{
// Control character
continue;
}
if (c < 127)
{
// ASCII Character
returnStringBuilder.Append(c);
}
else
{
returnStringBuilder.Append("U+" + ((int) c).ToString("X4"));
}
}
return returnStringBuilder.ToString();
}
We have the encoding method below that accepts an input string from a legacy system file (in the Unicode format of a VB6 string) and the name of an encoding. It applies the encoding and returns a string that displays correctly in our newer web applications. As our newer applications have a reporting backend that still depends on the old formats I have a need to reverse the encoding to allow newly translated strings to be stored in the legacy files. Here are two examples of conversions done by the Encode method.
Encode("µn¤J¦WºÙ", "BIG5") returns 登入名稱
Encode("Çàðåãèñòðèðîâàííîå èìÿ", "windows-1251") returns Зарегистрированное имя
To reverse these encodings I have been trying various encoding steps based on questions found here and elsewhere but have thus far succeeded only in producing output that appears identical to the input, is composed entirely of question marks or is composed of a mixture of ASCII characters and question marks different to the original input.
The encoding method was written by a departed colleague and I must admit that I don't fully understand why it has the coded loops, while all other examples I find simply get the characters from the string, then the bytes from those using the input encoding and finally get the string from the bytes using the output encoding. If I try removing the coded loops and just doing those three steps the method no longer returns the expected result.
Here is the encoding method, and my question is, how can I create a corresponding Decode method that reverses what it does?
private static string Encode(string src, string encoding)
{
if (String.IsNullOrWhiteSpace(encoding)) return src;
Encoding unicode = Encoding.Unicode;
Encoding sourceEncoding = Encoding.GetEncoding(encoding);
char[] srcChars = src.ToCharArray();
byte[] srcBytes = sourceEncoding.GetBytes(srcChars);
if (srcChars.Length == srcBytes.Length)
{
for (int i = 0; i < srcChars.Length; i++)
if ((int)srcChars[i] < 256)
srcBytes[i] = (byte)srcChars[i];
}
else
{
srcBytes = new byte[srcChars.Length];
for (int i = 0; i < srcChars.Length; i++)
srcBytes[i] = (byte)srcChars[i];
}
byte[] unicodeBytes = Encoding.Convert(sourceEncoding, unicode, srcBytes);
return unicode.GetString(unicodeBytes);
}
I have posted few questions about Tokens and Password reset and have managed to finally figure this all out. Thanks everyone!
So before reading that certain characters will not work in a query string, I decided to hash the query string but as you've guessed, the plus signs are stripped out.
How do you secure or hash a query string?
This is a sample from a company email I received and the string looks like this:
AweVZe-LujIAuh8i9HiXMCNDIRXfSZYv14o4KX0KywJAGlLklGC1hSw-bJWCYfia-pkBbessPNKtQQ&t=pr&ifl
In my setup, I am simply using a GUID. But does it matter?
In my scenario the user cannot access the password page, even without a GIUD. That's because the page is set to redirect onload if the query string don't match the session variable?
Are there ways to handle query string to give the result like above?
This question is more about acquiring knowledge.
UPDATE:
Here is the Hash Code:
public static string QueryStringHash(string input)
{
byte[] inputBytes = Encoding.UTF8.GetBytes();
SHA512Managed sha512 = new SHA512Managed();
byte[] outputBytes = sha512.ComputeHash(inputBytes);
return Convert.ToBase64String(outputBytes);
}
Then I pass the HASH (UserID) to a SESSION before sending it as a query string:
On the next page, the Session HASH is not the same as the Query which cause the values not to match and rendered the query string invalid.
Note: I created a Class called Encryption that handles all the Hash and Encryption.
Session["QueryString"] = Encryption.QueryStringHash(UserID);
Response.Redirect("~/public/reset-password.aspx?uprl=" +
HttpUtility.UrlEncode(Session["QueryString"].ToString()));
I also tried everything mentioned on this page but no luck:
How do I replace all the spaces with %20 in C#
Thanks for reading.
The problem is that base64 encoding uses the '+' and '/' characters, which have special meaning in URLs. If you want to base64 encode query parameters, you have to change those characters. Typically, that's done by replacing the '+' and '/' with '-' and '_' (dash and underscore), respectively, as specified in RFC 4648.
In your code, then, you'd do this:
public static string QueryStringHash(string input)
{
byte[] inputBytes = Encoding.UTF8.GetBytes();
SHA512Managed sha512 = new SHA512Managed();
byte[] outputBytes = sha512.ComputeHash(inputBytes);
string b64 = Convert.ToBase64String(outputBytes);
b64 = b64.Replace('+', '-');
return b64.Replace('/', '_');
}
On the receiving end, of course, you'll need to replace the '-' and '_' with the corresponding '+' and '/' before calling the method to convert from base 64.
They recommend not using the pad character ('='), but if you do, it should be URL encoded. There's no need to communicate the pad character if you always know how long your encoded strings are. You can add the required pad characters on the receiving end. But if you can have variable length strings, then you'll need the pad character.
Any time you see base 64 encoding used in query parameters, this is how it's done. It's all over the place, perhaps most commonly in YouTube video IDs.
I did something before where I had to pass a hash in a query string. As you've experienced Base 64 can be pretty nasty when mixed with URLs so I decided to pass it as a hex string instead. Its a little longer, but much easier to deal with. Here is how I did it:
First a method to transform binary into a hex string.
private static string GetHexFromData(byte[] bytes)
{
var output = new StringBuilder();
foreach (var b in bytes)
{
output.Append(b.ToString("X2"));
}
return output.ToString();
}
Then a reverse to convert a hex string back to binary.
private static byte[] GetDataFromHex(string hex)
{
var bytes = new List<byte>();
for (int i = 0; i < hex.Length; i += 2)
{
bytes.Add((byte)int.Parse(hex.Substring(i, 2), System.Globalization.NumberStyles.HexNumber));
}
return bytes.ToArray();
}
Alternatively if you just need to verify the hashes are the same, just convert both to hex strings and compare the strings (case-insensitive). hope this helps.
I connect to a webservice that gives me a response something like this(This is not the whole string, but you get the idea):
sResponse = "{\"Name\":\" Bod\u00f8\",\"homePage\":\"http:\/\/www.example.com\"}";
As you can see, the "Bod\u00f8" is not as it should be.
Therefor i tried to convert the unicode (\u00f8) to char by doing this with the string:
public string unicodeToChar(string sString)
{
StringBuilder sb = new StringBuilder();
foreach (char chars in sString)
{
if (chars >= 32 && chars <= 255)
{
sb.Append(chars);
}
else
{
// Replacement character
sb.Append((char)chars);
}
}
sString = sb.ToString();
return sString;
}
But it won't work, probably because the string is shown as \u00f8, and not \u00f8.
Now it would not be a problem if \u00f8 was the only unicode i had to convert, but i got many more of the unicodes.
That means that i can't just use the replace function :(
Hope someone can help.
You're basically talking about converting from JSON (JavaScript Object Notation). Try this link--near the bottom you'll see a list of publicly available libraries, including some in C#, that might do what you need.
The excellent Json.NET library has no problems decoding unicode escape sequences:
var sResponse = "{\"Name\":\"Bod\u00f8\",\"homePage\":\"http://www.ex.com\"}";
var obj = (JObject)JsonConvert.DeserializeObject(sResponse);
var name = ((JValue)obj["Name"]).Value;
var homePage = ((JValue)obj["homePage"]).Value;
Debug.Assert(Equals(name, "Bodø"));
Debug.Assert(Equals(homePage, "http://www.ex.com"));
This also allows you to deserialize to real POCO objects, making the code even cleaner (although less dynamic).
var obj = JsonConvert.DeserializeObject<Response>(sResponse);
Debug.Assert(obj2.Name == "Bodø");
Debug.Assert(obj2.HomePage == "http://www.ex.com");
public class Response
{
public string Name { get; set; }
public string HomePage { get; set; }
}
Perhaps you want to try:
string character = Encoding.UTF8.GetString(chars);
sb.Append(character);
I know this question is getting quite old, but I crashed into this problem as of today, while trying to access the Facebook Graph API. I was getting these strange \u00f8 and other variations back.
First I tried a simple replace as the OP also said (with the help from an online table). But I thought "no way!" after adding 2 replaces.
So after looking a little more at the "codes" it suddenly hit me...
The "\u" is a prefix, and the 4 characters after that is a hexadecimal encoded char code! So writing a simple regex to find all \u with 4 alphanumerical characters after, and afterwards converting the last 4 characters to integer and then to a character made the deal.
My source is in VB.NET
Private Function DecodeJsonString(ByVal Input As String) As String
For Each m As System.Text.RegularExpressions.Match In New System.Text.RegularExpressions.Regex("\\u(\w{4})").Matches(Input)
Input = Input.Replace(m.Value, Chr(CInt("&H" & m.Value.Substring(2))))
Next
Return Input
End Function
I also have a C# version here
private string DecodeJsonString(string Input)
{
foreach (System.Text.RegularExpressions.Match m in new System.Text.RegularExpressions.Regex(#"\\u(\w{4})").Matches(Input))
{
Input = Input.Replace(m.Value, ((char)(System.Int32.Parse(m.Value.Substring(2), System.Globalization.NumberStyles.AllowHexSpecifier))).ToString());
}
return Input;
}
I hope it can help someone out... I hate to add libraries when I really only need a few functions from them!
In C#, I have a string that I'm obtaining from WebClient.DownloadString. I've tried setting client.Encoding to new UTF8Encoding(false), but that's made no difference - I still end up with a byte order mark for UTF-8 at the beginning of the result string. I need to remove this (to parse the resulting XML with LINQ), and want to do so in memory.
So I have a string that starts with \x00EF\x00BB\x00BF, and I want to remove that if it exists. Right now I'm using
if (xml.StartsWith(ByteOrderMarkUtf8))
{
xml = xml.Remove(0, ByteOrderMarkUtf8.Length);
}
but that just feels wrong. I've tried all sorts of code with streams, GetBytes, and encodings, and nothing works. Can anyone provide the "right" algorithm to strip a BOM from a string?
I recently had issues with the .NET 4 upgrade, but until then the simple answer is
String.Trim()
removes the BOM up until .NET 3.5.
However, in .NET 4 you need to change it slightly:
String.Trim(new char[]{'\uFEFF'});
That will also get rid of the byte order mark, though you may also want to remove the ZERO WIDTH SPACE (U+200B):
String.Trim(new char[]{'\uFEFF','\u200B'});
This you could also use to remove other unwanted characters.
Some further information is from
String.Trim Method:
The .NET Framework 3.5 SP1 and earlier versions maintain an internal list of white-space characters that this method trims. Starting with the .NET Framework 4, the method trims all Unicode white-space characters (that is, characters that produce a true return value when they are passed to the Char.IsWhiteSpace method). Because of this change, the Trim method in the .NET Framework 3.5 SP1 and earlier versions removes two characters, ZERO WIDTH SPACE (U+200B) and ZERO WIDTH NO-BREAK SPACE (U+FEFF), that the Trim method in the .NET Framework 4 and later versions does not remove. In addition, the Trim method in the .NET Framework 3.5 SP1 and earlier versions does not trim three Unicode white-space characters: MONGOLIAN VOWEL SEPARATOR (U+180E), NARROW NO-BREAK SPACE (U+202F), and MEDIUM MATHEMATICAL SPACE (U+205F).
I had some incorrect test data, which caused me some confusion. Based on How to avoid tripping over UTF-8 BOM when reading files I found that this worked:
private readonly string _byteOrderMarkUtf8 =
Encoding.UTF8.GetString(Encoding.UTF8.GetPreamble());
public string GetXmlResponse(Uri resource)
{
string xml;
using (var client = new WebClient())
{
client.Encoding = Encoding.UTF8;
xml = client.DownloadString(resource);
}
if (xml.StartsWith(_byteOrderMarkUtf8, StringComparison.Ordinal))
{
xml = xml.Remove(0, _byteOrderMarkUtf8.Length);
}
return xml;
}
Setting the client Encoding property correctly reduces the BOM to a single character. However, XDocument.Parse still will not read that string. This is the cleanest version I've come up with to date.
This works as well
int index = xmlResponse.IndexOf('<');
if (index > 0)
{
xmlResponse = xmlResponse.Substring(index, xmlResponse.Length - index);
}
A quick and simple method to remove it directly from a string:
private static string RemoveBom(string p)
{
string BOMMarkUtf8 = Encoding.UTF8.GetString(Encoding.UTF8.GetPreamble());
if (p.StartsWith(BOMMarkUtf8))
p = p.Remove(0, BOMMarkUtf8.Length);
return p.Replace("\0", "");
}
How to use it:
string yourCleanString=RemoveBom(yourBOMString);
If the variable xml is of type string, you did something wrong already - in a character string, the BOM should not be represented as three separate characters, but as a single code point.
Instead of using DownloadString, use DownloadData, and parse byte arrays instead. The XML parser should recognize the BOM itself, and skip it (except for auto-detecting the document encoding as UTF-8).
I had a very similar problem (I needed to parse an XML document represented as a byte array that had a byte order mark at the beginning of it). I used one of Martin's comments on his answer to come to a solution. I took the byte array I had (instead of converting it to a string) and created a MemoryStream object with it. Then I passed it to XDocument.Load, which worked like a charm. For example, let's say that xmlBytes contains your XML in UTF-8 encoding with a byte mark at the beginning of it. Then, this would be the code to solve the problem:
var stream = new MemoryStream(xmlBytes);
var document = XDocument.Load(stream);
It's that simple.
If starting out with a string, it should still be easy to do (assume xml is your string containing the XML with the byte order mark):
var bytes = Encoding.UTF8.GetBytes(xml);
var stream = new MemoryStream(bytes);
var document = XDocument.Load(stream);
I wrote the following post after coming across this issue.
Essentially instead of reading in the raw bytes of the file's contents using the BinaryReader class, I use the StreamReader class with a specific constructor which automatically removes the byte order mark character from the textual data I am trying to retrieve.
It's of course best if you can strip it out while still on the byte array level to avoid unwanted substrings / allocs. But if you already have a string, this is perhaps the easiest and most performant way to handle this.
Usage:
string feed = ""; // input
bool hadBOM = FixBOMIfNeeded(ref feed);
var xElem = XElement.Parse(feed); // now does not fail
/// <summary>
/// You can get this or test it originally with: Encoding.UTF8.GetString(Encoding.UTF8.GetPreamble())[0];
/// But no need, this way we have a constant. As these three bytes `[239, 187, 191]` (a BOM) evaluate to a single C# char.
/// </summary>
public const char BOMChar = (char)65279;
public static bool FixBOMIfNeeded(ref string str)
{
if (string.IsNullOrEmpty(str))
return false;
bool hasBom = str[0] == BOMChar;
if (hasBom)
str = str.Substring(1);
return hasBom;
}
Pass the byte buffer (via DownloadData) to string Encoding.UTF8.GetString(byte[]) to get the string rather than download the buffer as a string. You probably have more problems with your current method than just trimming the byte order mark. Unless you're properly decoding it as I suggest here, Unicode characters will probably be misinterpreted, resulting in a corrupted string.
Martin's answer is better, since it avoids allocating an entire string for XML that still needs to be parsed anyway. The answer I gave best applies to general strings that don't need to be parsed as XML.
I ran into this when I had a Base64 encoded file to transform into the string. While I could have saved it to a file and then read it correctly, here's the best solution I could think of to get from the byte[] of the file to the string (based lightly on TrueWill's answer):
public static string GetUTF8String(byte[] data)
{
byte[] utf8Preamble = Encoding.UTF8.GetPreamble();
if (data.StartsWith(utf8Preamble))
{
return Encoding.UTF8.GetString(data, utf8Preamble.Length, data.Length - utf8Preamble.Length);
}
else
{
return Encoding.UTF8.GetString(data);
}
}
Where StartsWith(byte[]) is the logical extension:
public static bool StartsWith(this byte[] thisArray, byte[] otherArray)
{
// Handle invalid/unexpected input
// (nulls, thisArray.Length < otherArray.Length, etc.)
for (int i = 0; i < otherArray.Length; ++i)
{
if (thisArray[i] != otherArray[i])
{
return false;
}
}
return true;
}
StreamReader sr = new StreamReader(strFile, true);
XmlDocument xdoc = new XmlDocument();
xdoc.Load(sr);
Yet another generic variation to get rid of the UTF-8 BOM preamble:
var preamble = Encoding.UTF8.GetPreamble();
if (!functionBytes.Take(preamble.Length).SequenceEqual(preamble))
preamble = Array.Empty<Byte>();
return Encoding.UTF8.GetString(functionBytes, preamble.Length, functionBytes.Length - preamble.Length);
Use a regex replace to filter out any other characters other than the alphanumeric characters and spaces that are contained in a normal certificate thumbprint value:
certficateThumbprint = Regex.Replace(certficateThumbprint, #"[^a-zA-Z0-9\-\s*]", "");
And there you go. Voila!! It worked for me.
I solved the issue with the following code
using System.Xml.Linq;
void method()
{
byte[] bytes = GetXmlBytes();
XDocument doc;
using (var stream = new MemoryStream(docBytes))
{
doc = XDocument.Load(stream);
}
}