Strip the byte order mark from string in C# - c#

In C#, I have a string that I'm obtaining from WebClient.DownloadString. I've tried setting client.Encoding to new UTF8Encoding(false), but that's made no difference - I still end up with a byte order mark for UTF-8 at the beginning of the result string. I need to remove this (to parse the resulting XML with LINQ), and want to do so in memory.
So I have a string that starts with \x00EF\x00BB\x00BF, and I want to remove that if it exists. Right now I'm using
if (xml.StartsWith(ByteOrderMarkUtf8))
{
xml = xml.Remove(0, ByteOrderMarkUtf8.Length);
}
but that just feels wrong. I've tried all sorts of code with streams, GetBytes, and encodings, and nothing works. Can anyone provide the "right" algorithm to strip a BOM from a string?

I recently had issues with the .NET 4 upgrade, but until then the simple answer is
String.Trim()
removes the BOM up until .NET 3.5.
However, in .NET 4 you need to change it slightly:
String.Trim(new char[]{'\uFEFF'});
That will also get rid of the byte order mark, though you may also want to remove the ZERO WIDTH SPACE (U+200B):
String.Trim(new char[]{'\uFEFF','\u200B'});
This you could also use to remove other unwanted characters.
Some further information is from
String.Trim Method:
The .NET Framework 3.5 SP1 and earlier versions maintain an internal list of white-space characters that this method trims. Starting with the .NET Framework 4, the method trims all Unicode white-space characters (that is, characters that produce a true return value when they are passed to the Char.IsWhiteSpace method). Because of this change, the Trim method in the .NET Framework 3.5 SP1 and earlier versions removes two characters, ZERO WIDTH SPACE (U+200B) and ZERO WIDTH NO-BREAK SPACE (U+FEFF), that the Trim method in the .NET Framework 4 and later versions does not remove. In addition, the Trim method in the .NET Framework 3.5 SP1 and earlier versions does not trim three Unicode white-space characters: MONGOLIAN VOWEL SEPARATOR (U+180E), NARROW NO-BREAK SPACE (U+202F), and MEDIUM MATHEMATICAL SPACE (U+205F).

I had some incorrect test data, which caused me some confusion. Based on How to avoid tripping over UTF-8 BOM when reading files I found that this worked:
private readonly string _byteOrderMarkUtf8 =
Encoding.UTF8.GetString(Encoding.UTF8.GetPreamble());
public string GetXmlResponse(Uri resource)
{
string xml;
using (var client = new WebClient())
{
client.Encoding = Encoding.UTF8;
xml = client.DownloadString(resource);
}
if (xml.StartsWith(_byteOrderMarkUtf8, StringComparison.Ordinal))
{
xml = xml.Remove(0, _byteOrderMarkUtf8.Length);
}
return xml;
}
Setting the client Encoding property correctly reduces the BOM to a single character. However, XDocument.Parse still will not read that string. This is the cleanest version I've come up with to date.

This works as well
int index = xmlResponse.IndexOf('<');
if (index > 0)
{
xmlResponse = xmlResponse.Substring(index, xmlResponse.Length - index);
}

A quick and simple method to remove it directly from a string:
private static string RemoveBom(string p)
{
string BOMMarkUtf8 = Encoding.UTF8.GetString(Encoding.UTF8.GetPreamble());
if (p.StartsWith(BOMMarkUtf8))
p = p.Remove(0, BOMMarkUtf8.Length);
return p.Replace("\0", "");
}
How to use it:
string yourCleanString=RemoveBom(yourBOMString);

If the variable xml is of type string, you did something wrong already - in a character string, the BOM should not be represented as three separate characters, but as a single code point.
Instead of using DownloadString, use DownloadData, and parse byte arrays instead. The XML parser should recognize the BOM itself, and skip it (except for auto-detecting the document encoding as UTF-8).

I had a very similar problem (I needed to parse an XML document represented as a byte array that had a byte order mark at the beginning of it). I used one of Martin's comments on his answer to come to a solution. I took the byte array I had (instead of converting it to a string) and created a MemoryStream object with it. Then I passed it to XDocument.Load, which worked like a charm. For example, let's say that xmlBytes contains your XML in UTF-8 encoding with a byte mark at the beginning of it. Then, this would be the code to solve the problem:
var stream = new MemoryStream(xmlBytes);
var document = XDocument.Load(stream);
It's that simple.
If starting out with a string, it should still be easy to do (assume xml is your string containing the XML with the byte order mark):
var bytes = Encoding.UTF8.GetBytes(xml);
var stream = new MemoryStream(bytes);
var document = XDocument.Load(stream);

I wrote the following post after coming across this issue.
Essentially instead of reading in the raw bytes of the file's contents using the BinaryReader class, I use the StreamReader class with a specific constructor which automatically removes the byte order mark character from the textual data I am trying to retrieve.

It's of course best if you can strip it out while still on the byte array level to avoid unwanted substrings / allocs. But if you already have a string, this is perhaps the easiest and most performant way to handle this.
Usage:
string feed = ""; // input
bool hadBOM = FixBOMIfNeeded(ref feed);
var xElem = XElement.Parse(feed); // now does not fail
/// <summary>
/// You can get this or test it originally with: Encoding.UTF8.GetString(Encoding.UTF8.GetPreamble())[0];
/// But no need, this way we have a constant. As these three bytes `[239, 187, 191]` (a BOM) evaluate to a single C# char.
/// </summary>
public const char BOMChar = (char)65279;
public static bool FixBOMIfNeeded(ref string str)
{
if (string.IsNullOrEmpty(str))
return false;
bool hasBom = str[0] == BOMChar;
if (hasBom)
str = str.Substring(1);
return hasBom;
}

Pass the byte buffer (via DownloadData) to string Encoding.UTF8.GetString(byte[]) to get the string rather than download the buffer as a string. You probably have more problems with your current method than just trimming the byte order mark. Unless you're properly decoding it as I suggest here, Unicode characters will probably be misinterpreted, resulting in a corrupted string.
Martin's answer is better, since it avoids allocating an entire string for XML that still needs to be parsed anyway. The answer I gave best applies to general strings that don't need to be parsed as XML.

I ran into this when I had a Base64 encoded file to transform into the string. While I could have saved it to a file and then read it correctly, here's the best solution I could think of to get from the byte[] of the file to the string (based lightly on TrueWill's answer):
public static string GetUTF8String(byte[] data)
{
byte[] utf8Preamble = Encoding.UTF8.GetPreamble();
if (data.StartsWith(utf8Preamble))
{
return Encoding.UTF8.GetString(data, utf8Preamble.Length, data.Length - utf8Preamble.Length);
}
else
{
return Encoding.UTF8.GetString(data);
}
}
Where StartsWith(byte[]) is the logical extension:
public static bool StartsWith(this byte[] thisArray, byte[] otherArray)
{
// Handle invalid/unexpected input
// (nulls, thisArray.Length < otherArray.Length, etc.)
for (int i = 0; i < otherArray.Length; ++i)
{
if (thisArray[i] != otherArray[i])
{
return false;
}
}
return true;
}

StreamReader sr = new StreamReader(strFile, true);
XmlDocument xdoc = new XmlDocument();
xdoc.Load(sr);

Yet another generic variation to get rid of the UTF-8 BOM preamble:
var preamble = Encoding.UTF8.GetPreamble();
if (!functionBytes.Take(preamble.Length).SequenceEqual(preamble))
preamble = Array.Empty<Byte>();
return Encoding.UTF8.GetString(functionBytes, preamble.Length, functionBytes.Length - preamble.Length);

Use a regex replace to filter out any other characters other than the alphanumeric characters and spaces that are contained in a normal certificate thumbprint value:
certficateThumbprint = Regex.Replace(certficateThumbprint, #"[^a-zA-Z0-9\-\s*]", "");
And there you go. Voila!! It worked for me.

I solved the issue with the following code
using System.Xml.Linq;
void method()
{
byte[] bytes = GetXmlBytes();
XDocument doc;
using (var stream = new MemoryStream(docBytes))
{
doc = XDocument.Load(stream);
}
}

Related

Visual Basic to C#: load binary file in a string

I've to convert a project from old VB6 to c#, the aim is to preserve the old code as much possible as I can, for a matter of time.
A function of the old project loads a binary file into a string variable, and then this variable is analyzezed in its single characters values with the asc function:
OLD VB Code:
Public Function LoadText(ByVal DirIn As String) As String
Dim FileBuffer As String
Dim LenghtFile As Long
Dim ContIN As Long
ContIN = FreeFile
Open DirIn For Binary Access Read As #ContIN
LenghtFile = LOF(ContIN)
FileBuffer = Space(LenghtFile)
Get #ContIN, , FileBuffer
Close #ContIN
LoadText = FileBuffer
'following line for test purpose
debug.print(asc(mid(filebuffer,1,1)))
debug.print(asc(mid(filebuffer,2,1)))
debug.print(asc(mid(filebuffer,3,1)))
End Function
SUB Main
dim testSTring as String
teststring=loadtext("e:\testme.bin")
end sub
Result in immediate window:
1
10
133
C# code:
public static string LoadText(string dirIn)
{
string myString, myString2;
FileStream fs = new FileStream(dirIn, FileMode.Open);
BinaryReader br = new BinaryReader(fs);
byte[] bin = br.ReadBytes(Convert.ToInt32(fs.Length));
//myString = Convert.ToBase64String(bin);
myString = Encoding.Default.GetString(bin);
string m1 = Encoding.Default.GetString(bin);
//string m1 = Encoding.ASCII.GetString(bin);
//string m1 = Encoding.BigEndianUnicode.GetString(bin);
//string m1 = Encoding.UTF32.GetString(bin);
//string m1 = Encoding.UTF7.GetString(bin);
//string m1 = Encoding.UTF8.GetString(bin);
//string m1 = Encoding.Unicode.GetString(bin);
//string m1 = Encoding.Unicode.GetString(bin);
Console.WriteLine(General.Asc(m1.Substring(0, 1)));
Console.WriteLine(General.Asc(m1.Substring(1, 1)));
Console.WriteLine(General.Asc(m1.Substring(2, 1)));
br.Close();
fs.Close();
return myString;
}
General class:
public static int Asc(string stringToEValuate)
{
return (int)stringToEValuate[0];
}
Result in output window:
1
10
8230 <--fail!
The string in VB6 has a length 174848, identical to the size of the test file.
In c# is the same size for DEFAUILT and ASCII encoding, while all the others has different size and i cannot use them unless i change everithing in the whole project.
The problem is that I can't find the correct encoding that permits to have a string which asc function returns identical numbers to the VB6 one.
The problem is all there, if the string is not identical I have to change a lot of lines of code, because the whole program is based on ASCii value and the position of it in the string.
Maybe it's the wrong way to load a binary into a string, or the Asc function..
If you want to try the example file you can download it from here:
http:// www.snokie.org / testme.bin
8230 is correct. It is a UTF-16 code unit for the Unicode codepoint (U+2026, which only needs one UTF-16 code unit). You expected 133. 133 as one byte is the encoding for the same character in at least one other character set: Windows-1252.
There is no text but encoded text.
When you read a text file you have to know the encoding that was used to write it. Once you read into a .NET String or Char, you have it in Unicode's UTF-16 encoding. Because Unicode is a superset of any character set you would be using, it is not incorrect.
If you don't want to compare characters as characters, read them as binary to keep it in them in the same encoding as the file. You can then compare the byte sequences.
The problem is that the VB6 code, rather than using Unicode for character code like it should have, used the "default ANSI" character set, which changes meaning from system to system and user to user.
The problem is this: "old project loads a binary file into a string variable". Yes, this was a common—but bad—VB6 practice. String datatypes are for text. Strings in VB6 are UTF-16 code unit sequences, just like in .NET (and Java, JavaScript, HTML, XML, …).
Get #ContIN, , FileBuffer converts from the system's default ANSI code page to UTF-16 and Asc converts it back again. So, you just have to do that in your .NET code, too.
Note: Just like in the VB6, Encoding.Default is hazardous because it can vary from system to system and user to user.
Reference Microsoft.VisualBasic.dll and
using static Microsoft.VisualBasic.Strings;
Then
var fileBuffer = File.ReadAllText(path, Encoding.Default);
Debug.WriteLine(Asc(Mid(fileBuffer, 3, 1));
If you'd rather not bring Microsoft.VisualBasic.dll into a C# project, you can write your own versions
static class VB6StringReplacements
{
static public Byte Asc(String source) =>
Encoding.Default.GetBytes(source.Substring(0,1)).FirstOrDefault();
static public String Mid(String source, Int32 offset, Int32 length) =>
source.Substring(offset, length);
}
and, change your using directive to
using static VB6StringReplacements;

How can I reverse this text encoding?

We have the encoding method below that accepts an input string from a legacy system file (in the Unicode format of a VB6 string) and the name of an encoding. It applies the encoding and returns a string that displays correctly in our newer web applications. As our newer applications have a reporting backend that still depends on the old formats I have a need to reverse the encoding to allow newly translated strings to be stored in the legacy files. Here are two examples of conversions done by the Encode method.
Encode("µn¤J¦WºÙ", "BIG5") returns 登入名稱
Encode("Çàðåãèñòðèðîâàííîå èìÿ", "windows-1251") returns Зарегистрированное имя
To reverse these encodings I have been trying various encoding steps based on questions found here and elsewhere but have thus far succeeded only in producing output that appears identical to the input, is composed entirely of question marks or is composed of a mixture of ASCII characters and question marks different to the original input.
The encoding method was written by a departed colleague and I must admit that I don't fully understand why it has the coded loops, while all other examples I find simply get the characters from the string, then the bytes from those using the input encoding and finally get the string from the bytes using the output encoding. If I try removing the coded loops and just doing those three steps the method no longer returns the expected result.
Here is the encoding method, and my question is, how can I create a corresponding Decode method that reverses what it does?
private static string Encode(string src, string encoding)
{
if (String.IsNullOrWhiteSpace(encoding)) return src;
Encoding unicode = Encoding.Unicode;
Encoding sourceEncoding = Encoding.GetEncoding(encoding);
char[] srcChars = src.ToCharArray();
byte[] srcBytes = sourceEncoding.GetBytes(srcChars);
if (srcChars.Length == srcBytes.Length)
{
for (int i = 0; i < srcChars.Length; i++)
if ((int)srcChars[i] < 256)
srcBytes[i] = (byte)srcChars[i];
}
else
{
srcBytes = new byte[srcChars.Length];
for (int i = 0; i < srcChars.Length; i++)
srcBytes[i] = (byte)srcChars[i];
}
byte[] unicodeBytes = Encoding.Convert(sourceEncoding, unicode, srcBytes);
return unicode.GetString(unicodeBytes);
}

Convert special characters to normal

I need a way to convert special characters like this:
Helloæ
To normal characters. So this word would end up being Helloae. So far I have tried HttpUtility.Decode, or a method that would convert UTF8 to win1252, but nothing worked. Is there something simple and generic that would do this job?
Thank you.
EDIT
I have tried implementing those two methods using posts here on OC. Here's the methods:
public static string ConvertUTF8ToWin1252(string _source)
{
Encoding utf8 = new UTF8Encoding();
Encoding win1252 = Encoding.GetEncoding(1252);
byte[] input = _source.ToUTF8ByteArray();
byte[] output = Encoding.Convert(utf8, win1252, input);
return win1252.GetString(output);
}
// It should be noted that this method is expecting UTF-8 input only,
// so you probably should give it a more fitting name.
private static byte[] ToUTF8ByteArray(this string _str)
{
Encoding encoding = new UTF8Encoding();
return encoding.GetBytes(_str);
}
But it did not worked. The string remains the same way.
See: Does .NET transliteration library exists?
UnidecodeSharpFork
Usage:
var result = "Helloæ".Unidecode();
Console.WriteLine(result) // Prints Helloae
There is no direct mapping between æ and ae they are completely different unicode code points. If you need to do this you'll most likely need to write a function that maps the offending code points to the strings that you desire.
Per the comments you may need to take a two stage approach to this:
Remove the diacritics and combining characters per the link to the possible duplicate
Map any characters left that are not combining to alternate strings
switch(badChar){
case 'æ':
return "ae";
case 'ø':
return "oe";
// and so on
}

Encoding Conversion problem

I've got a little problem changing the ecoding of a string. Actually I read from a DB strings that are encoded using the codepage 850 and I have to prepare them in order to be suitable for an interoperable WCF service.
From the DB I read characters \x10 and \x11 (triangular shapes) and i want to convert them to the Unicode format in order to prevent serialization/deserialization problem during WCF call. (Chars
and are not valid according of the XML specs even if WCF serialize them).
Now, I use following code in order to covert string encoding, but nothing happens. Result string is in fact identical to the original one.
I'm probably missing something...
Please help me!!!
Emanuele
static class UnicodeEncodingExtension
{
public static string Convert(this Encoding sourceEncoding, Encoding targetEncoding, string value)
{
string reEncodedString = null;
byte[] sourceBytes = sourceEncoding.GetBytes(value);
byte[] targetBytes = Encoding.Convert(sourceEncoding, targetEncoding, sourceBytes);
reEncodedString = sourceEncoding.GetString(targetBytes);
return reEncodedString;
}
}
class Program
{
private static Encoding Cp850Encoding = Encoding.GetEncoding(850);
private static Encoding UnicodeEncoding = Encoding.UTF8;
static void Main(string[] args)
{
string value;
string resultValue;
value = "\x10";
resultValue = Cp850Encoding.Convert(UnicodeEncoding, value);
value = "\x11";
resultValue = Cp850Encoding.Convert(UnicodeEncoding, value);
value = "\u25b6";
resultValue = UnicodeEncoding.Convert(Cp850Encoding, value);
value = "\u25c0";
resultValue = UnicodeEncoding.Convert(Cp850Encoding, value);
}
}
It seems you think there is a problem based on an incorrect understanding. But jmservera is correct - all strings in .NET are encoded internally as unicode.
You didn't say exactly what you want to accomplish. Are you experiencing a problem at the other end of the wire?
Just FYI, you can set the text encoding on a WCF binding with the textMessageEncoding element in the config file.
I suspect this line may be your culprit
reEncodedString = sourceEncoding.GetString(targetBytes);
which seems to take your target encoded string of bytes and asks your sourceEncoding to make a string out of them. I've not had a chance to verify it but I suspect the following might be better
reEncodedString = targetEncoding.GetString(targetBytes);
All the strings stored in string are in fact Unicode.Unicode. Read: Strings in .Net and C# and The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
Edit: I suppose that you want the Convert function to automatically change \x11 to \u25c0, but the problem here is that \x11 is valid in almost any encoding, the differences usually start in character \x80, so the Convert function will maintain it even if you do that:
string reEncodedString = null;
byte[] unicodeBytes = UnicodeEncoding.Unicode.GetBytes(value);
byte[] sourceBytes = Encoding.Convert(Encoding.Unicode,
sourceEncoding, unicodeBytes);
You can see in unicode.org the mappings from CP850 to Unicode. So, for this conversion to happen you will have to change these characters manually.
byte[] sourceBytes =Encoding.Default.GetBytes(value)
Encoding.UTF8.GetString(sourceBytes)
this sequence usefull for download unicode file from service(for example xml file that contain persian character)
You should try this:
byte[] sourceBytes = sourceEncoding.GetBytes(value);
var convertedString = Encoding.UTF8.GetString(sourceBytes);

How to convert (transliterate) a string from utf8 to ASCII (single byte) in c#?

I have a string object
"with multiple characters and even special characters"
I am trying to use
UTF8Encoding utf8 = new UTF8Encoding();
ASCIIEncoding ascii = new ASCIIEncoding();
objects in order to convert that string to ascii. May I ask someone to bring some light to this simple task, that is hunting my afternoon.
EDIT 1:
What we are trying to accomplish is getting rid of special characters like some of the special windows apostrophes. The code that I posted below as an answer will not take care of that. Basically
O'Brian will become O?Brian. where ' is one of the special apostrophes
This was in response to your other question, that looks like it's been deleted....the point still stands.
Looks like a classic Unicode to ASCII issue. The trick would be to find where it's happening.
.NET works fine with Unicode, assuming it's told it's Unicode to begin with (or left at the default).
My guess is that your receiving app can't handle it. So, I'd probably use the ASCIIEncoder with an EncoderReplacementFallback with String.Empty:
using System.Text;
string inputString = GetInput();
var encoder = ASCIIEncoding.GetEncoder();
encoder.Fallback = new EncoderReplacementFallback(string.Empty);
byte[] bAsciiString = encoder.GetBytes(inputString);
// Do something with bytes...
// can write to a file as is
File.WriteAllBytes(FILE_NAME, bAsciiString);
// or turn back into a "clean" string
string cleanString = ASCIIEncoding.GetString(bAsciiString);
// since the offending bytes have been removed, can use default encoding as well
Assert.AreEqual(cleanString, Default.GetString(bAsciiString));
Of course, in the old days, we'd just loop though and remove any chars greater than 127...well, those of us in the US at least. ;)
I was able to figure it out. In case someone wants to know below the code that worked for me:
ASCIIEncoding ascii = new ASCIIEncoding();
byte[] byteArray = Encoding.UTF8.GetBytes(sOriginal);
byte[] asciiArray = Encoding.Convert(Encoding.UTF8, Encoding.ASCII, byteArray);
string finalString = ascii.GetString(asciiArray);
Let me know if there is a simpler way o doing it.
For anyone who likes Extension methods, this one does the trick for us.
using System.Text;
namespace System
{
public static class StringExtension
{
private static readonly ASCIIEncoding asciiEncoding = new ASCIIEncoding();
public static string ToAscii(this string dirty)
{
byte[] bytes = asciiEncoding.GetBytes(dirty);
string clean = asciiEncoding.GetString(bytes);
return clean;
}
}
}
(System namespace so it's available pretty much automatically for all of our strings.)
Based on Mark's answer above (and Geo's comment), I created a two liner version to remove all ASCII exception cases from a string. Provided for people searching for this answer (as I did).
using System.Text;
// Create encoder with a replacing encoder fallback
var encoder = ASCIIEncoding.GetEncoding("us-ascii",
new EncoderReplacementFallback(string.Empty),
new DecoderExceptionFallback());
string cleanString = encoder.GetString(encoder.GetBytes(dirtyString));
If you want 8 bit representation of characters that used in many encoding, this may help you.
You must change variable targetEncoding to whatever encoding you want.
Encoding targetEncoding = Encoding.GetEncoding(874); // Your target encoding
Encoding utf8 = Encoding.UTF8;
var stringBytes = utf8.GetBytes(Name);
var stringTargetBytes = Encoding.Convert(utf8, targetEncoding, stringBytes);
var ascii8BitRepresentAsCsString = Encoding.GetEncoding("Latin1").GetString(stringTargetBytes);

Categories