.Net XmlWriter - unexpected encoding is confusing me - c#

Environment is VS2008, .Net 3.5
The following C# code (note the specified encoding of UTF8)
XmlWriterSettings settings = new XmlWriterSettings ();
StringBuilder sb = new StringBuilder();
settings.Encoding = System.Text.Encoding.UTF8;
settings.Indent = false;
settings.NewLineChars = "\n";
settings.ConformanceLevel = System.Xml.ConformanceLevel.Document;
XmlWriter writer = XmlWriter.Create (sb, settings);
{
// Write XML data.
writer.WriteStartElement ("CCHEADER");
writer.WriteAttributeString ("ProtocolVersion", "1.0.0");
writer.WriteAttributeString ("ServerCapabilities", "0x0000000F");
writer.WriteEndElement ();
writer.Flush ();
}
Actually generates the XML (>< omitted because SO barfs on them):
?xml version="1.0" encoding="utf-16"?
CCHEADER ProtocolVersion="1.0.0" ServerCapabilities="0x0000000F" /
Why do I get the wrong encoding generated here ? What am I doing wrong ?

I suspect it's because it's writing to a StringBuilder, which is inherently UTF-16. An alternative to get round this is to create a class derived from StringWriter, but which overrides the Encoding property.
I believe I've got one in MiscUtil - but it's pretty trivial to write anyway. Something like this:
public sealed class StringWriterWithEncoding : StringWriter
{
private readonly Encoding encoding;
public StringWriterWithEncoding (Encoding encoding)
{
this.encoding = encoding;
}
public override Encoding Encoding
{
get { return encoding; }
}
}

A .Net String is encoded in Unicode (UTF-16). I expect this is the source of your encoding problems because you're writing to a StringBuilder.

Related

Encoding xml as ISO-8859-1

I sent an xml file which I created while serializing an object and received a response that it is incorrect and not well-formed:
<?xml version="1.0" encoding="utf-8"?>
Moreover, I am supposed to use ISO-8859-1.
I assume that I not only have to change <?xml version="1.0" encoding="ISO-8859-1"?>, but additionally I have to create the file during serialization from the code already with encoding ISO-8859-1. Correct?
I am doint it this way:
XmlSerializer ser = new XmlSerializer(obj.GetType());
var encoding = Encoding.GetEncoding("ISO-8859-1");
XmlWriterSettings xmlWriterSettings = new XmlWriterSettings
{
Indent = true,
OmitXmlDeclaration = false,
Encoding = encoding
};
XmlDocument xd = null;
using (MemoryStream memStm = new MemoryStream())
{
using (var xmlWriter = XmlWriter.Create(memStm, xmlWriterSettings))
{
ser.Serialize(xmlWriter, input);
}
memStm.Position = 0;
XmlReaderSettings settings = new XmlReaderSettings();
using (var xtr = XmlReader.Create(memStm, settings))
{
xd = new XmlDocument();
xd.Load(xtr);
}
}
byte[] file = encoding.GetBytes(xml.OuterXml);
I used a framework to find out what encoding my created files have and when I create them with ISO-8859-1 as above my encoding checker gives me ASCII, is that correct?
I sent an xml file which I created while serializing an object and received a response that it is incorrect and not well-formed:
The  chars represents a BOM (byte-order-mark) for utf-8 files. That BOM can be a part of utf-8 encoded files. So your xml is valid if read properly.
More information about BOM: http://www.unicode.org/faq/utf_bom.html#bom1
I assume that I not only have to change <?xml version="1.0" encoding="ISO-8859-1"?>, but additionally I have to create the file during serialization from the code already with encoding ISO-8859-1. Correct?
Correct.
I used a framework to find out what encoding my created files have and when I create them with ISO-8859-1 as above my encoding checker gives me ASCII, is that correct?
So, the encoding of a text file cannot be determined exactly, but only "guessed" by means of an analysis. Various encodings have the same code pages for the ASCII characters, therefore ASCII is suitable as result.

How to stop invalid characters in xml with a limited (windows-1251) encoding in c#

this one's puzzling me no end. I need to send an xml message to a Russian webservice. The XML has to be encoded in windows-1251
I have a number of objects that respond to the different types of messages and I turn them into xml thus:
public string Serialise(Type t, object o, XmlSerializerNamespaces Namespaces)
{
XmlSerializer serialiser = _serialisers.First(s => s.GetType().FullName.Contains(t.Name));
Windows1251StringWriter myWriter = new Windows1251StringWriter();
serialiser.Serialize(myWriter, o, Namespaces);
return myWriter.ToString();
}
public class Windows1251StringWriter : StringWriter
{
public override Encoding Encoding
{
get { return Encoding.GetEncoding(1251); }
}
}
which works fine but the web service rejects requests if we send any characters that aren't in windows-1251. In the latest example I tried to send a phone number with 'LEFT-TO-RIGHT EMBEDDING' (U+202A), 'NON-BREAKING HYPHEN' (U+2011) and god help us 'POP DIRECTIONAL FORMATTING' (U+202C). I have no control over the input. I'd like to turn any unknown characters into ? or remove them. I've tried messing with the EncoderFallback but it doesn't seem to change anything.
Am I going about this wrong?
Since you are serializing to a string, the only thing the Encoding property in Windows1251StringWriter does for you is to change the name of the encoding shown in the XML:
<?xml version="1.0" encoding="windows-1251"?>
(I think this trick comes from here.)
And that's it. All c# strings are always encoded in utf-16 and the base class StringWriter writes to this encoding no matter what, regardless of whether the Encoding property is overridden.
To strip away characters from your XML that are invalid in some specific encoding, you need to encode it down to a byte stream, then decode it, e.g. with the following:
public static class XmlSerializationHelper
{
public static string GetXml<X>(this X toSerialize, XmlSerializer serializer = null, XmlSerializerNamespaces namespaces = null, Encoding encoding = null)
{
if (toSerialize == null)
throw new ArgumentNullException();
encoding = encoding ?? Encoding.UTF8;
serializer = serializer ?? new XmlSerializer(toSerialize.GetType());
using (var stream = new MemoryStream())
using (var writer = new StreamWriter(stream, encoding))
{
serializer.Serialize(writer, toSerialize, namespaces);
writer.Flush();
stream.Position = 0;
using (var reader = new StreamReader(stream, encoding))
{
return reader.ReadToEnd();
}
}
}
}
Then do
var encoding = Encoding.GetEncoding(1251, new EncoderReplacementFallback(""), new DecoderExceptionFallback());
return o.GetXml(serialiser, Namespaces, encoding);

How to replace encoding=utf-8 in the xml header with an empty string using XmlTextWriter?

I am using XmlTextWriter in C# to generate XML's from CSV's and in the result XML I am getting the following header <xml version="1.0" encoding="utf-8"> and the problem is to replace this encoding=utf-8 with an empty string so that the header becomes <xml version="1.0"?>. I have searched a lot and have not been able to find anything as of now. Would love to know solutions to combat this problem. The piece of code that generates this is as follows -:
var writer = new XmlTextWriter(s, Encoding.UTF8) {
Formatting = Formatting.Indented
};
writer.WriteStartDocument();
According to the docs it is possible:
public XmlTextWriter(
Stream w,
Encoding encoding
)
encoding
Type: System.Text.Encoding
The encoding to generate. If encoding is null it writes out the stream as UTF-8 and omits the encoding attribute from the ProcessingInstruction.
You can create a subclass of XmlTextWriter:
public class XmlOmitEncodingWriter : XmlTextWriter
{
public XmlOmitEncodingWriter(Stream w, Encoding encoding) : base(w, encoding)
{}
public XmlOmitEncodingWriter(string filename, Encoding encoding) : base(filename, encoding)
{}
public XmlOmitEncodingWriter(TextWriter w) : base(w)
{}
public override void WriteStartDocument()
{
WriteRaw("<?xml version=\"1.0\"?>");
}
}
Use it as such:
var writer = new XmlOmitEncodingWriter(s, Encoding.UTF8) {
Formatting = Formatting.Indented
};
writer.WriteStartDocument();
This will output <?xml version="1.0"?>. It will also support writing the document in any encoding, not tying you to UTF-8.

Serializing an object as UTF-8 XML in .NET

Proper object disposal removed for brevity but I'm shocked if this is the simplest way to encode an object as UTF-8 in memory. There has to be an easier way doesn't there?
var serializer = new XmlSerializer(typeof(SomeSerializableObject));
var memoryStream = new MemoryStream();
var streamWriter = new StreamWriter(memoryStream, System.Text.Encoding.UTF8);
serializer.Serialize(streamWriter, entry);
memoryStream.Seek(0, SeekOrigin.Begin);
var streamReader = new StreamReader(memoryStream, System.Text.Encoding.UTF8);
var utf8EncodedXml = streamReader.ReadToEnd();
No, you can use a StringWriter to get rid of the intermediate MemoryStream. However, to force it into XML you need to use a StringWriter which overrides the Encoding property:
public class Utf8StringWriter : StringWriter
{
public override Encoding Encoding => Encoding.UTF8;
}
Or if you're not using C# 6 yet:
public class Utf8StringWriter : StringWriter
{
public override Encoding Encoding { get { return Encoding.UTF8; } }
}
Then:
var serializer = new XmlSerializer(typeof(SomeSerializableObject));
string utf8;
using (StringWriter writer = new Utf8StringWriter())
{
serializer.Serialize(writer, entry);
utf8 = writer.ToString();
}
Obviously you can make Utf8StringWriter into a more general class which accepts any encoding in its constructor - but in my experience UTF-8 is by far the most commonly required "custom" encoding for a StringWriter :)
Now as Jon Hanna says, this will still be UTF-16 internally, but presumably you're going to pass it to something else at some point, to convert it into binary data... at that point you can use the above string, convert it into UTF-8 bytes, and all will be well - because the XML declaration will specify "utf-8" as the encoding.
EDIT: A short but complete example to show this working:
using System;
using System.Text;
using System.IO;
using System.Xml.Serialization;
public class Test
{
public int X { get; set; }
static void Main()
{
Test t = new Test();
var serializer = new XmlSerializer(typeof(Test));
string utf8;
using (StringWriter writer = new Utf8StringWriter())
{
serializer.Serialize(writer, t);
utf8 = writer.ToString();
}
Console.WriteLine(utf8);
}
public class Utf8StringWriter : StringWriter
{
public override Encoding Encoding => Encoding.UTF8;
}
}
Result:
<?xml version="1.0" encoding="utf-8"?>
<Test xmlns:xsd="http://www.w3.org/2001/XMLSchema"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<X>0</X>
</Test>
Note the declared encoding of "utf-8" which is what we wanted, I believe.
Your code doesn't get the UTF-8 into memory as you read it back into a string again, so its no longer in UTF-8, but back in UTF-16 (though ideally its best to consider strings at a higher level than any encoding, except when forced to do so).
To get the actual UTF-8 octets you could use:
var serializer = new XmlSerializer(typeof(SomeSerializableObject));
var memoryStream = new MemoryStream();
var streamWriter = new StreamWriter(memoryStream, System.Text.Encoding.UTF8);
serializer.Serialize(streamWriter, entry);
byte[] utf8EncodedXml = memoryStream.ToArray();
I've left out the same disposal you've left. I slightly favour the following (with normal disposal left in):
var serializer = new XmlSerializer(typeof(SomeSerializableObject));
using(var memStm = new MemoryStream())
using(var xw = XmlWriter.Create(memStm))
{
serializer.Serialize(xw, entry);
var utf8 = memStm.ToArray();
}
Which is much the same amount of complexity, but does show that at every stage there is a reasonable choice to do something else, the most pressing of which is to serialise to somewhere other than to memory, such as to a file, TCP/IP stream, database, etc. All in all, it's not really that verbose.
Very good answer using inheritance, just remember to override the initializer
public class Utf8StringWriter : StringWriter
{
public Utf8StringWriter(StringBuilder sb) : base (sb)
{
}
public override Encoding Encoding { get { return Encoding.UTF8; } }
}
I found this blog post which explains the problem very well, and defines a few different solutions:
(dead link removed)
I've settled for the idea that the best way to do it is to completely omit the XML declaration when in memory. It actually is UTF-16 at that point anyway, but the XML declaration doesn't seem meaningful until it has been written to a file with a particular encoding; and even then the declaration is not required. It doesn't seem to break deserialization, at least.
As #Jon Hanna mentions, this can be done with an XmlWriter created like this:
XmlWriter writer = XmlWriter.Create (output, new XmlWriterSettings() { OmitXmlDeclaration = true });

How to put an encoding attribute to xml other that utf-16 with XmlWriter?

I've got a function creating some XmlDocument:
public string CreateOutputXmlString(ICollection<Field> fields)
{
XmlWriterSettings settings = new XmlWriterSettings();
settings.Indent = true;
settings.Encoding = Encoding.GetEncoding("windows-1250");
StringBuilder builder = new StringBuilder();
XmlWriter writer = XmlWriter.Create(builder, settings);
writer.WriteStartDocument();
writer.WriteStartElement("data");
foreach (Field field in fields)
{
writer.WriteStartElement("item");
writer.WriteAttributeString("name", field.Id);
writer.WriteAttributeString("value", field.Value);
writer.WriteEndElement();
}
writer.WriteEndElement();
writer.Flush();
writer.Close();
return builder.ToString();
}
I set an encoding but after i create XmlWriter it does have utf-16 encoding. I know it's because strings (and StringBuilder i suppose) are encoded in utf-16 and you can't change it.
So how can I easily create this xml with the encoding attribute set to "windows-1250"? it doesn't even have to be encoded in this encoding, it just has to have the specified attribute.
edit: it has to be in .Net 2.0 so any new framework elements cannot be used.
You need to use a StringWriter with the appropriate encoding. Unfortunately StringWriter doesn't let you specify the encoding directly, so you need a class like this:
public sealed class StringWriterWithEncoding : StringWriter
{
private readonly Encoding encoding;
public StringWriterWithEncoding (Encoding encoding)
{
this.encoding = encoding;
}
public override Encoding Encoding
{
get { return encoding; }
}
}
(This question is similar but not quite a duplicate.)
EDIT: To answer the comment: pass the StringWriterWithEncoding to XmlWriter.Create instead of the StringBuilder, then call ToString() on it at the end.
Just some extra explanations to why this is so.
Strings are sequences of characters, not bytes. Strings, per se, are not "encoded", because they are using characters, which are stored as Unicode codepoints. Encoding DOES NOT MAKE SENSE at String level.
An encoding is a mapping from a sequence of codepoints (characters) to a sequence of bytes (for storage on byte-based systems like filesystems or memory). The framework does not let you specify encodings, unless there is a compelling reason to, like to make 16-bit codepoints fit on byte-based storage.
So when you're trying to write your XML into a StringBuilder, you're actually building an XML sequence of characters and writing them as a sequence of characters, so no encoding is performed. Therefore, no Encoding field.
If you want to use an encoding, the XmlWriter has to write to a Stream.
About the solution that you found with the MemoryStream, no offense intended, but it's just flapping around arms and moving hot air. You're encoding your codepoints with 'windows-1252', and then parsing it back to codepoints. The only change that may occur is that characters not defined in windows-1252 get converted to a '?' character in the process.
To me, the right solution might be the following one. Depending on what your function is used for, you could pass a Stream as a parameter to your function, so that the caller decides whether it should be written to memory or to a file. So it would be written like this:
public static void WriteFieldsAsXmlDocument(ICollection fields, Stream outStream)
{
XmlWriterSettings settings = new XmlWriterSettings();
settings.Indent = true;
settings.Encoding = Encoding.GetEncoding("windows-1250");
using(XmlWriter writer = XmlWriter.Create(outStream, settings)) {
writer.WriteStartDocument();
writer.WriteStartElement("data");
foreach (Field field in fields)
{
writer.WriteStartElement("item");
writer.WriteAttributeString("name", field.Id);
writer.WriteAttributeString("value", field.Value);
writer.WriteEndElement();
}
writer.WriteEndElement();
}
}
MemoryStream memoryStream = new MemoryStream();
XmlWriterSettings xmlWriterSettings = new XmlWriterSettings();
xmlWriterSettings.Encoding = Encoding.UTF8;
XmlWriter xmlWriter = XmlWriter.Create(memoryStream, xmlWriterSettings);
xmlWriter.WriteStartDocument();
xmlWriter.WriteStartElement("root", "http://www.timvw.be/ns");
xmlWriter.WriteEndElement();
xmlWriter.WriteEndDocument();
xmlWriter.Flush();
xmlWriter.Close();
string xmlString = Encoding.UTF8.GetString(memoryStream.ToArray());
From here
I actually solved the problem with MemoryStream:
public static string CreateOutputXmlString(ICollection<Field> fields)
{
XmlWriterSettings settings = new XmlWriterSettings();
settings.Indent = true;
settings.Encoding = Encoding.GetEncoding("windows-1250");
MemoryStream memStream = new MemoryStream();
XmlWriter writer = XmlWriter.Create(memStream, settings);
writer.WriteStartDocument();
writer.WriteStartElement("data");
foreach (Field field in fields)
{
writer.WriteStartElement("item");
writer.WriteAttributeString("name", field.Id);
writer.WriteAttributeString("value", field.Value);
writer.WriteEndElement();
}
writer.WriteEndElement();
writer.Flush();
writer.Close();
writer.Flush();
writer.Close();
string xml = Encoding.GetEncoding("windows-1250").GetString(memStream.ToArray());
memStream.Close();
memStream.Dispose();
return xml;
}
I solved mine by outputting the string to a variable then replacing any references to utf-16 with utf-8 (my app needed UTF8 encoding). Since you're using a function, you could do something similar. I use VB.net mostly, but I think the C# would look something like this.
return builder.ToString().Replace("utf-16", "utf-8");

Categories