I'm currently searching for an easy way to serialize objects (in C# 3).
I googled some examples and came up with something like:
MemoryStream memoryStream = new MemoryStream ( );
XmlSerializer xs = new XmlSerializer ( typeof ( MyObject) );
XmlTextWriter xmlTextWriter = new XmlTextWriter ( memoryStream, Encoding.UTF8 );
xs.Serialize ( xmlTextWriter, myObject);
string result = Encoding.UTF8.GetString(memoryStream .ToArray());
After reading this question I asked myself, why not using StringWriter? It seems much easier.
XmlSerializer ser = new XmlSerializer(typeof(MyObject));
StringWriter writer = new StringWriter();
ser.Serialize(writer, myObject);
serializedValue = writer.ToString();
Another Problem was, that the first example generated XML I could not just write into an XML column of SQL Server 2005 DB.
The first question is: Is there a reason why I shouldn't use StringWriter to serialize an Object when I need it as a string afterwards? I never found a result using StringWriter when googling.
The second is, of course: If you should not do it with StringWriter (for whatever reasons), which would be a good and correct way?
Addition:
As it was already mentioned by both answers, I'll further go into the XML to DB problem.
When writing to the Database I got the following exception:
System.Data.SqlClient.SqlException:
XML parsing: line 1, character 38,
unable to switch the encoding
For string
<?xml version="1.0" encoding="utf-8"?><test/>
I took the string created from the XmlTextWriter and just put as xml there. This one did not work (neither with manual insertion into the DB).
Afterwards I tried manual insertion (just writing INSERT INTO ... ) with encoding="utf-16" which also failed.
Removing the encoding totally worked then. After that result I switched back to the StringWriter code and voila - it worked.
Problem: I don't really understand why.
at Christian Hayter: With those tests I'm not sure that I have to use utf-16 to write to the DB. Wouldn't setting the encoding to UTF-16 (in the xml tag) work then?
One problem with StringWriter is that by default it doesn't let you set the encoding which it advertises - so you can end up with an XML document advertising its encoding as UTF-16, which means you need to encode it as UTF-16 if you write it to a file. I have a small class to help with that though:
public sealed class StringWriterWithEncoding : StringWriter
{
public override Encoding Encoding { get; }
public StringWriterWithEncoding (Encoding encoding)
{
Encoding = encoding;
}
}
Or if you only need UTF-8 (which is all I often need):
public sealed class Utf8StringWriter : StringWriter
{
public override Encoding Encoding => Encoding.UTF8;
}
As for why you couldn't save your XML to the database - you'll have to give us more details about what happened when you tried, if you want us to be able to diagnose/fix it.
When serialising an XML document to a .NET string, the encoding must be set to UTF-16. Strings are stored as UTF-16 internally, so this is the only encoding that makes sense. If you want to store data in a different encoding, you use a byte array instead.
SQL Server works on a similar principle; any string passed into an xml column must be encoded as UTF-16. SQL Server will reject any string where the XML declaration does not specify UTF-16. If the XML declaration is not present, then the XML standard requires that it default to UTF-8, so SQL Server will reject that as well.
Bearing this in mind, here are some utility methods for doing the conversion.
public static string Serialize<T>(T value) {
if(value == null) {
return null;
}
XmlSerializer serializer = new XmlSerializer(typeof(T));
XmlWriterSettings settings = new XmlWriterSettings()
{
Encoding = new UnicodeEncoding(false, false), // no BOM in a .NET string
Indent = false,
OmitXmlDeclaration = false
};
using(StringWriter textWriter = new StringWriter()) {
using(XmlWriter xmlWriter = XmlWriter.Create(textWriter, settings)) {
serializer.Serialize(xmlWriter, value);
}
return textWriter.ToString();
}
}
public static T Deserialize<T>(string xml) {
if(string.IsNullOrEmpty(xml)) {
return default(T);
}
XmlSerializer serializer = new XmlSerializer(typeof(T));
XmlReaderSettings settings = new XmlReaderSettings();
// No settings need modifying here
using(StringReader textReader = new StringReader(xml)) {
using(XmlReader xmlReader = XmlReader.Create(textReader, settings)) {
return (T) serializer.Deserialize(xmlReader);
}
}
}
First of all, beware of finding old examples. You've found one that uses XmlTextWriter, which is deprecated as of .NET 2.0. XmlWriter.Create should be used instead.
Here's an example of serializing an object into an XML column:
public void SerializeToXmlColumn(object obj)
{
using (var outputStream = new MemoryStream())
{
using (var writer = XmlWriter.Create(outputStream))
{
var serializer = new XmlSerializer(obj.GetType());
serializer.Serialize(writer, obj);
}
outputStream.Position = 0;
using (var conn = new SqlConnection(Settings.Default.ConnectionString))
{
conn.Open();
const string INSERT_COMMAND = #"INSERT INTO XmlStore (Data) VALUES (#Data)";
using (var cmd = new SqlCommand(INSERT_COMMAND, conn))
{
using (var reader = XmlReader.Create(outputStream))
{
var xml = new SqlXml(reader);
cmd.Parameters.Clear();
cmd.Parameters.AddWithValue("#Data", xml);
cmd.ExecuteNonQuery();
}
}
}
}
}
<TL;DR> The problem is rather simple, actually: you are not matching the declared encoding (in the XML declaration) with the datatype of the input parameter. If you manually added <?xml version="1.0" encoding="utf-8"?><test/> to the string, then declaring the SqlParameter to be of type SqlDbType.Xml or SqlDbType.NVarChar would give you the "unable to switch the encoding" error. Then, when inserting manually via T-SQL, since you switched the declared encoding to be utf-16, you were clearly inserting a VARCHAR string (not prefixed with an upper-case "N", hence an 8-bit encoding, such as UTF-8) and not an NVARCHAR string (prefixed with an upper-case "N", hence the 16-bit UTF-16 LE encoding).
The fix should have been as simple as:
In the first case, when adding the declaration stating encoding="utf-8": simply don't add the XML declaration.
In the second case, when adding the declaration stating encoding="utf-16": either
simply don't add the XML declaration, OR
simply add an "N" to the input parameter type: SqlDbType.NVarChar instead of SqlDbType.VarChar :-) (or possibly even switch to using SqlDbType.Xml)
(Detailed response is below)
All of the answers here are over-complicated and unnecessary (regardless of the 121 and 184 up-votes for Christian's and Jon's answers, respectively). They might provide working code, but none of them actually answer the question. The issue is that nobody truly understood the question, which ultimately is about how the XML datatype in SQL Server works. Nothing against those two clearly intelligent people, but this question has little to nothing to do with serializing to XML. Saving XML data into SQL Server is much easier than what is being implied here.
It doesn't really matter how the XML is produced as long as you follow the rules of how to create XML data in SQL Server. I have a more thorough explanation (including working example code to illustrate the points outlined below) in an answer on this question: How to solve “unable to switch the encoding” error when inserting XML into SQL Server, but the basics are:
The XML declaration is optional
The XML datatype stores strings always as UCS-2 / UTF-16 LE
If your XML is UCS-2 / UTF-16 LE, then you:
pass in the data as either NVARCHAR(MAX) or XML / SqlDbType.NVarChar (maxsize = -1) or SqlDbType.Xml, or if using a string literal then it must be prefixed with an upper-case "N".
if specifying the XML declaration, it must be either "UCS-2" or "UTF-16" (no real difference here)
If your XML is 8-bit encoded (e.g. "UTF-8" / "iso-8859-1" / "Windows-1252"), then you:
need to specify the XML declaration IF the encoding is different than the code page specified by the default Collation of the database
you must pass in the data as VARCHAR(MAX) / SqlDbType.VarChar (maxsize = -1), or if using a string literal then it must not be prefixed with an upper-case "N".
Whatever 8-bit encoding is used, the "encoding" noted in the XML declaration must match the actual encoding of the bytes.
The 8-bit encoding will be converted into UTF-16 LE by the XML datatype
With the points outlined above in mind, and given that strings in .NET are always UTF-16 LE / UCS-2 LE (there is no difference between those in terms of encoding), we can answer your questions:
Is there a reason why I shouldn't use StringWriter to serialize an Object when I need it as a string afterwards?
No, your StringWriter code appears to be just fine (at least I see no issues in my limited testing using the 2nd code block from the question).
Wouldn't setting the encoding to UTF-16 (in the xml tag) work then?
It isn't necessary to provide the XML declaration. When it is missing, the encoding is assumed to be UTF-16 LE if you pass the string into SQL Server as NVARCHAR (i.e. SqlDbType.NVarChar) or XML (i.e. SqlDbType.Xml). The encoding is assumed to be the default 8-bit Code Page if passing in as VARCHAR (i.e. SqlDbType.VarChar). If you have any non-standard-ASCII characters (i.e. values 128 and above) and are passing in as VARCHAR, then you will likely see "?" for BMP characters and "??" for Supplementary Characters as SQL Server will convert the UTF-16 string from .NET into an 8-bit string of the current Database's Code Page before converting it back into UTF-16 / UCS-2. But you shouldn't get any errors.
On the other hand, if you do specify the XML declaration, then you must pass into SQL Server using the matching 8-bit or 16-bit datatype. So if you have a declaration stating that the encoding is either UCS-2 or UTF-16, then you must pass in as SqlDbType.NVarChar or SqlDbType.Xml. Or, if you have a declaration stating that the encoding is one of the 8-bit options (i.e. UTF-8, Windows-1252, iso-8859-1, etc), then you must pass in as SqlDbType.VarChar. Failure to match the declared encoding with the proper 8 or 16 -bit SQL Server datatype will result in the "unable to switch the encoding" error that you were getting.
For example, using your StringWriter-based serialization code, I simply printed the resulting string of the XML and used it in SSMS. As you can see below, the XML declaration is included (because StringWriter does not have an option to OmitXmlDeclaration like XmlWriter does), which poses no problem so long as you pass the string in as the correct SQL Server datatype:
-- Upper-case "N" prefix == NVARCHAR, hence no error:
DECLARE #Xml XML = N'<?xml version="1.0" encoding="utf-16"?>
<string>Test ሴ😸</string>';
SELECT #Xml;
-- <string>Test ሴ😸</string>
As you can see, it even handles characters beyond standard ASCII, given that ሴ is BMP Code Point U+1234, and 😸 is Supplementary Character Code Point U+1F638. However, the following:
-- No upper-case "N" prefix on the string literal, hence VARCHAR:
DECLARE #Xml XML = '<?xml version="1.0" encoding="utf-16"?>
<string>Test ሴ😸</string>';
results in the following error:
Msg 9402, Level 16, State 1, Line XXXXX
XML parsing: line 1, character 39, unable to switch the encoding
Ergo, all of that explanation aside, the full solution to your original question is:
You were clearly passing the string in as SqlDbType.VarChar. Switch to SqlDbType.NVarChar and it will work without needing to go through the extra step of removing the XML declaration. This is preferred over keeping SqlDbType.VarChar and removing the XML declaration because this solution will prevent data loss when the XML includes non-standard-ASCII characters. For example:
-- No upper-case "N" prefix on the string literal == VARCHAR, and no XML declaration:
DECLARE #Xml2 XML = '<string>Test ሴ😸</string>';
SELECT #Xml2;
-- <string>Test ???</string>
As you can see, there is no error this time, but now there is data-loss 🙀.
public static T DeserializeFromXml<T>(string xml)
{
T result;
XmlSerializerFactory serializerFactory = new XmlSerializerFactory();
XmlSerializer serializer =serializerFactory.CreateSerializer(typeof(T));
using (StringReader sr3 = new StringReader(xml))
{
XmlReaderSettings settings = new XmlReaderSettings()
{
CheckCharacters = false // default value is true;
};
using (XmlReader xr3 = XmlTextReader.Create(sr3, settings))
{
result = (T)serializer.Deserialize(xr3);
}
}
return result;
}
For anyone in need of an F# version of the approved answer:
type private Utf8StringWriter() =
inherit StringWriter()
override _.Encoding = System.Text.Encoding.UTF8
It may have been covered elsewhere but simply changing the encoding line of the XML source to 'utf-16' allows the XML to be inserted into a SQL Server 'xml'data type.
using (DataSetTableAdapters.SQSTableAdapter tbl_SQS = new DataSetTableAdapters.SQSTableAdapter())
{
try
{
bodyXML = #"<?xml version="1.0" encoding="UTF-8" standalone="yes"?><test></test>";
bodyXMLutf16 = bodyXML.Replace("UTF-8", "UTF-16");
tbl_SQS.Insert(messageID, receiptHandle, md5OfBody, bodyXMLutf16, sourceType);
}
catch (System.Data.SqlClient.SqlException ex)
{
Console.WriteLine(ex.Message);
Console.ReadLine();
}
}
The result is all of the XML text is inserted into the 'xml' data type field but the 'header' line is removed. What you see in the resulting record is just
<test></test>
Using the serialization method described in the "Answered" entry is a way of including the original header in the target field but the result is that the remaining XML text is enclosed in an XML <string></string> tag.
The table adapter in the code is a class automatically built using the Visual Studio 2013 "Add New Data Source: wizard. The five parameters to the Insert method map to fields in a SQL Server table.
Related
I am trying to open an xml file in browser and I get the following error:
Switch from current encoding to specified encoding not supported. Error processing resource
Now I am generating this XML using C# by the code below. My XML has latin characters so I want to have encoding as ISO-8859-1 and not utf-16. But each time my xml generate, it take encoding as utf-16 and not ISO-8859-1. I would like to know what is the cause and resolution.
Note: I only need ISO-8859-1 and not utf-8 or utf-16 as the XML is getting generated properly if the encode is ISO-8859-1.
C# code:
var doc = new XDocument(new XDeclaration("1.0", "iso-8859-1", "yes"),new XElement("webapp", new XAttribute("name", webApp.Name), new XAttribute("id", webApp.Id), Myinformation());
var wr= new StringWriter();
doc.Save(wr);
return wr.ToString();
A string, conceptually at least, doesn't have an encoding. It's just a sequence of unicode characters. That said, a string in memory is essentially UTF-16, hence StringWriter returns that from its Encoding property.
What you need is a writer that specifies the encoding you want:
var doc = new XDocument(new XElement("root"));
byte[] bytes;
var isoEncoding = Encoding.GetEncoding("ISO-8859-1");
var settings = new XmlWriterSettings
{
Indent = true,
Encoding = isoEncoding
};
using (var ms = new MemoryStream())
{
using (var writer = XmlWriter.Create(ms, settings))
{
doc.Save(writer);
}
bytes = ms.ToArray();
}
This will give you the binary data encoded using ISO-8859-1, and as a consequence the declaration will specify that encoding.
You can convert this back to a string:
var text = isoEncoding.GetString(bytes);
But note that this is no longer encoded in ISO-8859-1, it's just a string again. You could very easily encode it using something else and your declaration would be incorrect.
If you're sure you want this as an intermediate string, then I'd suggest you use the approach in this answer.
I've to convert a project from old VB6 to c#, the aim is to preserve the old code as much possible as I can, for a matter of time.
A function of the old project loads a binary file into a string variable, and then this variable is analyzezed in its single characters values with the asc function:
OLD VB Code:
Public Function LoadText(ByVal DirIn As String) As String
Dim FileBuffer As String
Dim LenghtFile As Long
Dim ContIN As Long
ContIN = FreeFile
Open DirIn For Binary Access Read As #ContIN
LenghtFile = LOF(ContIN)
FileBuffer = Space(LenghtFile)
Get #ContIN, , FileBuffer
Close #ContIN
LoadText = FileBuffer
'following line for test purpose
debug.print(asc(mid(filebuffer,1,1)))
debug.print(asc(mid(filebuffer,2,1)))
debug.print(asc(mid(filebuffer,3,1)))
End Function
SUB Main
dim testSTring as String
teststring=loadtext("e:\testme.bin")
end sub
Result in immediate window:
1
10
133
C# code:
public static string LoadText(string dirIn)
{
string myString, myString2;
FileStream fs = new FileStream(dirIn, FileMode.Open);
BinaryReader br = new BinaryReader(fs);
byte[] bin = br.ReadBytes(Convert.ToInt32(fs.Length));
//myString = Convert.ToBase64String(bin);
myString = Encoding.Default.GetString(bin);
string m1 = Encoding.Default.GetString(bin);
//string m1 = Encoding.ASCII.GetString(bin);
//string m1 = Encoding.BigEndianUnicode.GetString(bin);
//string m1 = Encoding.UTF32.GetString(bin);
//string m1 = Encoding.UTF7.GetString(bin);
//string m1 = Encoding.UTF8.GetString(bin);
//string m1 = Encoding.Unicode.GetString(bin);
//string m1 = Encoding.Unicode.GetString(bin);
Console.WriteLine(General.Asc(m1.Substring(0, 1)));
Console.WriteLine(General.Asc(m1.Substring(1, 1)));
Console.WriteLine(General.Asc(m1.Substring(2, 1)));
br.Close();
fs.Close();
return myString;
}
General class:
public static int Asc(string stringToEValuate)
{
return (int)stringToEValuate[0];
}
Result in output window:
1
10
8230 <--fail!
The string in VB6 has a length 174848, identical to the size of the test file.
In c# is the same size for DEFAUILT and ASCII encoding, while all the others has different size and i cannot use them unless i change everithing in the whole project.
The problem is that I can't find the correct encoding that permits to have a string which asc function returns identical numbers to the VB6 one.
The problem is all there, if the string is not identical I have to change a lot of lines of code, because the whole program is based on ASCii value and the position of it in the string.
Maybe it's the wrong way to load a binary into a string, or the Asc function..
If you want to try the example file you can download it from here:
http:// www.snokie.org / testme.bin
8230 is correct. It is a UTF-16 code unit for the Unicode codepoint (U+2026, which only needs one UTF-16 code unit). You expected 133. 133 as one byte is the encoding for the same character in at least one other character set: Windows-1252.
There is no text but encoded text.
When you read a text file you have to know the encoding that was used to write it. Once you read into a .NET String or Char, you have it in Unicode's UTF-16 encoding. Because Unicode is a superset of any character set you would be using, it is not incorrect.
If you don't want to compare characters as characters, read them as binary to keep it in them in the same encoding as the file. You can then compare the byte sequences.
The problem is that the VB6 code, rather than using Unicode for character code like it should have, used the "default ANSI" character set, which changes meaning from system to system and user to user.
The problem is this: "old project loads a binary file into a string variable". Yes, this was a common—but bad—VB6 practice. String datatypes are for text. Strings in VB6 are UTF-16 code unit sequences, just like in .NET (and Java, JavaScript, HTML, XML, …).
Get #ContIN, , FileBuffer converts from the system's default ANSI code page to UTF-16 and Asc converts it back again. So, you just have to do that in your .NET code, too.
Note: Just like in the VB6, Encoding.Default is hazardous because it can vary from system to system and user to user.
Reference Microsoft.VisualBasic.dll and
using static Microsoft.VisualBasic.Strings;
Then
var fileBuffer = File.ReadAllText(path, Encoding.Default);
Debug.WriteLine(Asc(Mid(fileBuffer, 3, 1));
If you'd rather not bring Microsoft.VisualBasic.dll into a C# project, you can write your own versions
static class VB6StringReplacements
{
static public Byte Asc(String source) =>
Encoding.Default.GetBytes(source.Substring(0,1)).FirstOrDefault();
static public String Mid(String source, Int32 offset, Int32 length) =>
source.Substring(offset, length);
}
and, change your using directive to
using static VB6StringReplacements;
I want to serialize a .NET object to JSON which contains foreign language strings such as Chinese or Russian. When i do that (using the code below) in the resulting JSON it encodes those characters which are stored as strings as "?" instead of the requisite unicode char.
using Newtonsoft.Json;
var serialized = JsonConvert.SerializeObject(myObj, new JsonSerializerSettings { TypeNameHandling = TypeNameHandling.All, Formatting = Newtonsoft.Json.Formatting.Indented });
Is there a way to use the JSON.Net serializer with foreign languages?
E.g
אספירין (hebrew)
एस्पिरि (hindi)
阿司匹林 (chinese)
アセチルサリチル酸 (japanese)
Many Thanks!
It is not the serializer that is causing this issue; Json.Net handles foreign characters just fine. More likely you are doing one of the following:
Using an inappropriate encoding (or not setting the encoding) when writing the JSON to a file or stream. You should probably be using Encoding.UTF8.
Storing the JSON into a varchar column in your database rather than nvarchar. varchar does not support unicode characters.
Viewing the JSON with a viewer that does not support unicode, uses the wrong encoding and/or uses a font that does not have the full set of unicode character glyphs. The Windows command prompt window seems to have this issue, for example.
To prove that the serializer is not the problem, try compiling and running the following example program. It will create two different output files from the same JSON, one using UTF-8 encoding and the other using the default encoding. Open each file using Notepad. The "default" file will have the foreign characters as ? characters. In the UTF-8 encoded file, you should see all the characters are intact. (If you still don't see them, try changing the Notepad font to "Arial Unicode MS".)
You can also see the foreign characters are correct in the JSON using the Visual Studio debugger; just put a breakpoint after the line where it serializes the JSON and examine the json variable.
using System;
using System.Collections.Generic;
using System.IO;
using Newtonsoft.Json;
class Program
{
static void Main(string[] args)
{
List<Foo> foos = new List<Foo>
{
new Foo { Language = "Hebrew", Sample = "אספירין" },
new Foo { Language = "Hindi", Sample = "एस्पिरि" },
new Foo { Language = "Chinese", Sample = "阿司匹林" },
new Foo { Language = "Japanese", Sample = "アセチルサリチル酸" },
};
var json = JsonConvert.SerializeObject(foos, Formatting.Indented);
File.WriteAllText("utf8.json", json, Encoding.UTF8);
File.WriteAllText("default.json", json, Encoding.Default);
}
}
class Foo
{
public string Language { get; set; }
public string Sample { get; set; }
}
I have been using an Arabic text and I find the solution here
In this section Serialize all characters
options = new JsonSerializerOptions
{
Encoder = JavaScriptEncoder.UnsafeRelaxedJsonEscaping,
WriteIndented = true
};
jsonString = JsonSerializer.Serialize(weatherForecast, options);
I have an application which serializes and deserializes .NET objects to XML. While deserializing I am getting the following error:
"There is an error in XML
Document(1,2) Name cannot begin with
the '.' character, hexadecimal value
0x00. Line 1, position 2. "
The code snippet that does the deserializing is:
string xmlEntity = _loanReader["LoanEntity"].ToString();
XmlSerializer xs2 = new XmlSerializer(typeof(Model.Loan));
MemoryStream memoryStream2 = new MemoryStream(StringFunction.StringToUTF16ByteArray(xmlEntity));
XmlTextWriter xmlTextWriter2 = new XmlTextWriter(memoryStream2, Encoding.Unicode);
_loan = (Model.Loan)xs2.Deserialize(memoryStream2);
I am using a datareader to get the resultset from the stored procedure. LoanEntity is an XML type field in the loan table.
A snippet of the XML stored in the field:
<Loan xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<GUID>d2cc9dc3-45b0-44bd-b9d2-6ef5e7ddb54c</GUID><LoanNumber>DEV999999</LoanNumber>
....
I have spent countless hours trying to figure out what the error means but to no avail. Any help will be appreciated.
This is usually an issue with encoding. I see you have the string bring converted to a UTF16 byte array. Have you checked that is should not be UTF8 instead? I would give that a go and see what comes of it. Basically the deserializer might be looking for a different encoding.
You must be working from an old example, and a bad one. Try this:
string xmlEntity = _loanReader["LoanEntity"].ToString();
XmlSerializer xs2 = new XmlSerializer(typeof(Model.Loan));
using (MemoryStream memoryStream2 = new MemoryStream(StringFunction.StringToUTF16ByteArray(xmlEntity)))
{
XmlWriterSettings settings = new XmlWriterSettings { Encoding = Encoding.Unicode};
using (XmlWriter writer = XmlWriter.Create(memoryStream2, settings))
{
_loan = (Model.Loan)xs2.Deserialize(memoryStream2);
}
}
I believe I may have found a solution to this. Since SQL Server XML field expects Unicode type encoding of values, I tried using a StringReader instead of a MemoryStream and things work well so far. The following StackOverFlow post helped as well:
Using StringWriter for XML Serialization
In C#, I have a string that I'm obtaining from WebClient.DownloadString. I've tried setting client.Encoding to new UTF8Encoding(false), but that's made no difference - I still end up with a byte order mark for UTF-8 at the beginning of the result string. I need to remove this (to parse the resulting XML with LINQ), and want to do so in memory.
So I have a string that starts with \x00EF\x00BB\x00BF, and I want to remove that if it exists. Right now I'm using
if (xml.StartsWith(ByteOrderMarkUtf8))
{
xml = xml.Remove(0, ByteOrderMarkUtf8.Length);
}
but that just feels wrong. I've tried all sorts of code with streams, GetBytes, and encodings, and nothing works. Can anyone provide the "right" algorithm to strip a BOM from a string?
I recently had issues with the .NET 4 upgrade, but until then the simple answer is
String.Trim()
removes the BOM up until .NET 3.5.
However, in .NET 4 you need to change it slightly:
String.Trim(new char[]{'\uFEFF'});
That will also get rid of the byte order mark, though you may also want to remove the ZERO WIDTH SPACE (U+200B):
String.Trim(new char[]{'\uFEFF','\u200B'});
This you could also use to remove other unwanted characters.
Some further information is from
String.Trim Method:
The .NET Framework 3.5 SP1 and earlier versions maintain an internal list of white-space characters that this method trims. Starting with the .NET Framework 4, the method trims all Unicode white-space characters (that is, characters that produce a true return value when they are passed to the Char.IsWhiteSpace method). Because of this change, the Trim method in the .NET Framework 3.5 SP1 and earlier versions removes two characters, ZERO WIDTH SPACE (U+200B) and ZERO WIDTH NO-BREAK SPACE (U+FEFF), that the Trim method in the .NET Framework 4 and later versions does not remove. In addition, the Trim method in the .NET Framework 3.5 SP1 and earlier versions does not trim three Unicode white-space characters: MONGOLIAN VOWEL SEPARATOR (U+180E), NARROW NO-BREAK SPACE (U+202F), and MEDIUM MATHEMATICAL SPACE (U+205F).
I had some incorrect test data, which caused me some confusion. Based on How to avoid tripping over UTF-8 BOM when reading files I found that this worked:
private readonly string _byteOrderMarkUtf8 =
Encoding.UTF8.GetString(Encoding.UTF8.GetPreamble());
public string GetXmlResponse(Uri resource)
{
string xml;
using (var client = new WebClient())
{
client.Encoding = Encoding.UTF8;
xml = client.DownloadString(resource);
}
if (xml.StartsWith(_byteOrderMarkUtf8, StringComparison.Ordinal))
{
xml = xml.Remove(0, _byteOrderMarkUtf8.Length);
}
return xml;
}
Setting the client Encoding property correctly reduces the BOM to a single character. However, XDocument.Parse still will not read that string. This is the cleanest version I've come up with to date.
This works as well
int index = xmlResponse.IndexOf('<');
if (index > 0)
{
xmlResponse = xmlResponse.Substring(index, xmlResponse.Length - index);
}
A quick and simple method to remove it directly from a string:
private static string RemoveBom(string p)
{
string BOMMarkUtf8 = Encoding.UTF8.GetString(Encoding.UTF8.GetPreamble());
if (p.StartsWith(BOMMarkUtf8))
p = p.Remove(0, BOMMarkUtf8.Length);
return p.Replace("\0", "");
}
How to use it:
string yourCleanString=RemoveBom(yourBOMString);
If the variable xml is of type string, you did something wrong already - in a character string, the BOM should not be represented as three separate characters, but as a single code point.
Instead of using DownloadString, use DownloadData, and parse byte arrays instead. The XML parser should recognize the BOM itself, and skip it (except for auto-detecting the document encoding as UTF-8).
I had a very similar problem (I needed to parse an XML document represented as a byte array that had a byte order mark at the beginning of it). I used one of Martin's comments on his answer to come to a solution. I took the byte array I had (instead of converting it to a string) and created a MemoryStream object with it. Then I passed it to XDocument.Load, which worked like a charm. For example, let's say that xmlBytes contains your XML in UTF-8 encoding with a byte mark at the beginning of it. Then, this would be the code to solve the problem:
var stream = new MemoryStream(xmlBytes);
var document = XDocument.Load(stream);
It's that simple.
If starting out with a string, it should still be easy to do (assume xml is your string containing the XML with the byte order mark):
var bytes = Encoding.UTF8.GetBytes(xml);
var stream = new MemoryStream(bytes);
var document = XDocument.Load(stream);
I wrote the following post after coming across this issue.
Essentially instead of reading in the raw bytes of the file's contents using the BinaryReader class, I use the StreamReader class with a specific constructor which automatically removes the byte order mark character from the textual data I am trying to retrieve.
It's of course best if you can strip it out while still on the byte array level to avoid unwanted substrings / allocs. But if you already have a string, this is perhaps the easiest and most performant way to handle this.
Usage:
string feed = ""; // input
bool hadBOM = FixBOMIfNeeded(ref feed);
var xElem = XElement.Parse(feed); // now does not fail
/// <summary>
/// You can get this or test it originally with: Encoding.UTF8.GetString(Encoding.UTF8.GetPreamble())[0];
/// But no need, this way we have a constant. As these three bytes `[239, 187, 191]` (a BOM) evaluate to a single C# char.
/// </summary>
public const char BOMChar = (char)65279;
public static bool FixBOMIfNeeded(ref string str)
{
if (string.IsNullOrEmpty(str))
return false;
bool hasBom = str[0] == BOMChar;
if (hasBom)
str = str.Substring(1);
return hasBom;
}
Pass the byte buffer (via DownloadData) to string Encoding.UTF8.GetString(byte[]) to get the string rather than download the buffer as a string. You probably have more problems with your current method than just trimming the byte order mark. Unless you're properly decoding it as I suggest here, Unicode characters will probably be misinterpreted, resulting in a corrupted string.
Martin's answer is better, since it avoids allocating an entire string for XML that still needs to be parsed anyway. The answer I gave best applies to general strings that don't need to be parsed as XML.
I ran into this when I had a Base64 encoded file to transform into the string. While I could have saved it to a file and then read it correctly, here's the best solution I could think of to get from the byte[] of the file to the string (based lightly on TrueWill's answer):
public static string GetUTF8String(byte[] data)
{
byte[] utf8Preamble = Encoding.UTF8.GetPreamble();
if (data.StartsWith(utf8Preamble))
{
return Encoding.UTF8.GetString(data, utf8Preamble.Length, data.Length - utf8Preamble.Length);
}
else
{
return Encoding.UTF8.GetString(data);
}
}
Where StartsWith(byte[]) is the logical extension:
public static bool StartsWith(this byte[] thisArray, byte[] otherArray)
{
// Handle invalid/unexpected input
// (nulls, thisArray.Length < otherArray.Length, etc.)
for (int i = 0; i < otherArray.Length; ++i)
{
if (thisArray[i] != otherArray[i])
{
return false;
}
}
return true;
}
StreamReader sr = new StreamReader(strFile, true);
XmlDocument xdoc = new XmlDocument();
xdoc.Load(sr);
Yet another generic variation to get rid of the UTF-8 BOM preamble:
var preamble = Encoding.UTF8.GetPreamble();
if (!functionBytes.Take(preamble.Length).SequenceEqual(preamble))
preamble = Array.Empty<Byte>();
return Encoding.UTF8.GetString(functionBytes, preamble.Length, functionBytes.Length - preamble.Length);
Use a regex replace to filter out any other characters other than the alphanumeric characters and spaces that are contained in a normal certificate thumbprint value:
certficateThumbprint = Regex.Replace(certficateThumbprint, #"[^a-zA-Z0-9\-\s*]", "");
And there you go. Voila!! It worked for me.
I solved the issue with the following code
using System.Xml.Linq;
void method()
{
byte[] bytes = GetXmlBytes();
XDocument doc;
using (var stream = new MemoryStream(docBytes))
{
doc = XDocument.Load(stream);
}
}