Hyphen in XML causes XmlWriter to fail with UTF-8 error - c#

I have XML with the element: <DESCRIPTION>fault – No reply</DESCRIPTION>
I am transforming the Xml from a web-service as follows based on Jon Skeet's code https://stackoverflow.com/a/427737/197229 (the original Xml validates fine):
public sealed class StringWriterUTF8 : StringWriter
{
public override Encoding Encoding
{
get { return Encoding.UTF8; }
}
}
WebRequest request = WebRequest.Create(url);
WebResponse response = request.GetResponse();
Stream stream = response.GetResponseStream();
StreamReader streamReader = new StreamReader(stream);
string xml = streamReader.ReadToEnd();
logger.Log().Debug(String.Format("Received Xml:\n{0}", xml));
if(Transform != null)
{
using (var stringReader = new StringReader(xml))
using (var xmlReader = XmlReader.Create(stringReader))
using (var stringWriter = new StringWriterUTF8())
using (var xmlTextWriter = XmlWriter.Create(stringWriter, new XmlWriterSettings()
{ Indent= true}))
{
Transform.Transform(xmlReader,xmlTextWriter);
xml = stringWriter.ToString();
logger.Log().Debug(String.Format("Transformed Xml:\n{0}", xml));
}
}
Everything looks great... but the generated XML is failing validation when I try to use it, even though to the naked eye it looks fine. If I remove that hyphen, there are no problems.
I don't understand why the original XML is fine and the .Net classes are getting tripped up, but if I try and validate the Xml in Notepad++ I get this:
Input is not proper UTF-8, indicate encoding ! Bytes: 0x96 0x20 0x4E
0x6F
How can I resolve this? All I want to do is receive Xml and transform it to a new Xml file without encoding weirdness!

Related

Converting xml to stream and disabling dtd breaking Deserialization

I have created the following wrapper method to disable DTD
public class Program
{
public static void Main(string[] args)
{
string s = #"<?xml version =""1.0"" encoding=""utf-16""?>
<ArrayOfSerializingTemplateItem xmlns:xsd=""http://www.w3.org/2001/XMLSchema"" xmlns:xsi=""http://www.w3.org/2001/XMLSchema-instance"">
<SerializingTemplateItem>
</SerializingTemplateItem>
</ArrayOfSerializingTemplateItem >";
try
{
XmlReader reader = XmlWrapper.CreateXmlReaderObject(s);
XmlSerializer sr = new XmlSerializer(typeof(List<SerializingTemplateItem>));
Object ob = sr.Deserialize(reader);
}
catch (Exception ex)
{
Console.WriteLine(ex);
throw;
}
Console.ReadLine();
}
}
public class XmlWrapper
{
public static XmlReader CreateXmlReaderObject(string sr)
{
byte[] byteArray = Encoding.UTF8.GetBytes(sr);
MemoryStream stream = new MemoryStream(byteArray);
stream.Position = 0;
XmlReaderSettings settings = new XmlReaderSettings();
settings.ValidationType = ValidationType.None;
settings.DtdProcessing = DtdProcessing.Ignore;
return XmlReader.Create(stream, settings);
}
}
public class SerializingTemplateItem
{
}
The above throws exception "There is no Unicode byte order mark. Cannot switch to Unicode." (Demo fiddle here: https://dotnetfiddle.net/pGxOE9).
But if I use the following code to create the XmlReader instead of calling the XmlWrapper method. It works fine.
StringReader stringReader = new StringReader( xml );
XmlReader reader = new XmlTextReader( stringReader );
But I need to use the wrapper method as a security requirement to disable DTD. I don't know why I am unable to deserialize after calling my wrapper method. Any help will be highly appreciated.
Your problem is that you have encoded the XML into a MemoryStream using Encoding.UTF8, but the XML string itself claims to be encoded in UTF-16 in the encoding declaration in its XML text declaration:
<?xml version ="1.0" encoding="utf-16"?>
<ArrayOfSerializingTemplateItem>
<!-- Content omitted -->
</ArrayOfSerializingTemplateItem >
Apparently when the XmlReader encounters this declaration, it tries honor the declaration and switch from UTF-8 to UTF-16 but fails for some reason - possibly because the stream really is encoded in UTF-8. Conversely when the deprecated XmlTextReader encounters the declaration, it apparently just ignores it as not implemented, which happens to cause things to work successfully in this situation.
The simplest way to resolve this is to read directly from the string using a StringReader using XmlReader.Create(TextReader, XmlReaderSettings):
public class XmlWrapper
{
public static XmlReader CreateXmlReaderObject(string sr)
{
var settings = new XmlReaderSettings
{
ValidationType = ValidationType.None,
DtdProcessing = DtdProcessing.Ignore,
};
return XmlReader.Create(new StringReader(sr), settings);
}
}
Since a c# string is always encoded internally in UTF-16 the encoding statement in the XML will be ignored as irrelevant. This will also be more performant as the conversion to an intermediate byte array is completely skipped.
Incidentally, you should dispose of your XmlReader via a using statement:
Object ob;
using (var reader = XmlWrapper.CreateXmlReaderObject(s))
{
XmlSerializer sr = new XmlSerializer(typeof(List<SerializingTemplateItem>));
ob = sr.Deserialize(reader);
}
Working sample fiddle here.
Related questions:
Meaning of - <?xml version="1.0" encoding="utf-8"?>
Ignoring specified encoding when deserializing XML

How to stop invalid characters in xml with a limited (windows-1251) encoding in c#

this one's puzzling me no end. I need to send an xml message to a Russian webservice. The XML has to be encoded in windows-1251
I have a number of objects that respond to the different types of messages and I turn them into xml thus:
public string Serialise(Type t, object o, XmlSerializerNamespaces Namespaces)
{
XmlSerializer serialiser = _serialisers.First(s => s.GetType().FullName.Contains(t.Name));
Windows1251StringWriter myWriter = new Windows1251StringWriter();
serialiser.Serialize(myWriter, o, Namespaces);
return myWriter.ToString();
}
public class Windows1251StringWriter : StringWriter
{
public override Encoding Encoding
{
get { return Encoding.GetEncoding(1251); }
}
}
which works fine but the web service rejects requests if we send any characters that aren't in windows-1251. In the latest example I tried to send a phone number with 'LEFT-TO-RIGHT EMBEDDING' (U+202A), 'NON-BREAKING HYPHEN' (U+2011) and god help us 'POP DIRECTIONAL FORMATTING' (U+202C). I have no control over the input. I'd like to turn any unknown characters into ? or remove them. I've tried messing with the EncoderFallback but it doesn't seem to change anything.
Am I going about this wrong?
Since you are serializing to a string, the only thing the Encoding property in Windows1251StringWriter does for you is to change the name of the encoding shown in the XML:
<?xml version="1.0" encoding="windows-1251"?>
(I think this trick comes from here.)
And that's it. All c# strings are always encoded in utf-16 and the base class StringWriter writes to this encoding no matter what, regardless of whether the Encoding property is overridden.
To strip away characters from your XML that are invalid in some specific encoding, you need to encode it down to a byte stream, then decode it, e.g. with the following:
public static class XmlSerializationHelper
{
public static string GetXml<X>(this X toSerialize, XmlSerializer serializer = null, XmlSerializerNamespaces namespaces = null, Encoding encoding = null)
{
if (toSerialize == null)
throw new ArgumentNullException();
encoding = encoding ?? Encoding.UTF8;
serializer = serializer ?? new XmlSerializer(toSerialize.GetType());
using (var stream = new MemoryStream())
using (var writer = new StreamWriter(stream, encoding))
{
serializer.Serialize(writer, toSerialize, namespaces);
writer.Flush();
stream.Position = 0;
using (var reader = new StreamReader(stream, encoding))
{
return reader.ReadToEnd();
}
}
}
}
Then do
var encoding = Encoding.GetEncoding(1251, new EncoderReplacementFallback(""), new DecoderExceptionFallback());
return o.GetXml(serialiser, Namespaces, encoding);

Why is Windows Azure storage blob XML truncated?

I'm trying to save an xml file into a blob. I have no error, everything seems fine except when I navigate to the blob url I see a blank page. If I look at the source code of the web page I can see my xml but truncated.
Here is te code.
StringBuilder fileString = new StringBuilder();
XmlWriterSettings xmlSettings=new XmlWriterSettings
{
Encoding = new UTF8Encoding(false)
};
using (XmlWriter writer = XmlWriter.Create(fileString, xmlSettings))
{
bla bla
}
CloudBlockBlob fileBlob = container.GetBlockBlobReference("site.xml");
fileBlob.UploadText(fileString.ToString());
I found the solution in some other post (not so much the problem although it has to do with encoding of text always being utf-16 despite setting up the writer as utf-8). I am now using a Stream and it works fine.
MemoryStream fileString = new MemoryStream();
XmlWriterSettings xmlSettings=new XmlWriterSettings
{
Encoding = Encoding.UTF8,
Indent = true
};
using (XmlWriter writer = XmlWriter.Create(fileString, xmlSettings))
{
bla bla
}
CloudBlockBlob fileBlob = container.GetBlockBlobReference("site.xml");
fileBlob.UploadText(StreamToString(fileString));
private static string StreamToString(Stream stream)
{
stream.Position = 0;
var reader = new StreamReader(stream);
return reader.ReadToEnd();
}
If you have decided to use a Stream, you can also upload it using UploadFromStream instead of UploadText, which encodes the string into a sequence of bytes, creates a memory stream, and calls UploadFromStream anyway.

XDocument.Load() Error

I have some code:
WebRequest request = HttpWebRequest.Create(url);
WebResponse response = request.GetResponse();
using (System.IO.StreamReader sr =
new System.IO.StreamReader(response.GetResponseStream()))
{
System.Xml.Linq.XDocument doc = new System.Xml.Linq.XDocument();
doc.Load(new System.IO.StringReader(sr.ReadToEnd()));
}
I can't load my response in my XML document. I get the following error:
Member 'System.XMl.Linq.XDocument.Load(System.IO.TextReader' cannot be accessed
with an instance reference; qualify it with a type name instead.
This is becoming really frustrating. What am I doing wrong?
Unlike XmlDocument.Load, XDocument.Load is a static method returning a new XDocument:
XDocument doc = XDocument.Load(new StringReader(sr.ReadToEnd()));
It seems pretty pointless to read the stream to the end then create a StringReader though. It's also pointless creating the StreamReader in the first place - and if the XML document isn't in UTF-8, it could cause problems. Better:
For .NET 4, where there's an XDocument.Load(Stream) overload:
using (var response = request.GetResponse())
{
using (var stream = response.GetResponseStream())
{
var doc = XDocument.Load(stream);
}
}
For .NET 3.5, where there isn't:
using (var response = request.GetResponse())
{
using (var stream = response.GetResponseStream())
{
var doc = XDocument.Load(XmlReader.Create(stream));
}
}
Or alternatively, just let LINQ to XML do all the work:
XDocument doc = XDocument.Load(url);
EDIT: Note that the compiler error did give you enough information to get you going: it told you that you can't call XDocument.Load as doc.Load, and to give the type name instead. Your next step should have been to consult the documentation, which of course gives examples.

Serializing an object as UTF-8 XML in .NET

Proper object disposal removed for brevity but I'm shocked if this is the simplest way to encode an object as UTF-8 in memory. There has to be an easier way doesn't there?
var serializer = new XmlSerializer(typeof(SomeSerializableObject));
var memoryStream = new MemoryStream();
var streamWriter = new StreamWriter(memoryStream, System.Text.Encoding.UTF8);
serializer.Serialize(streamWriter, entry);
memoryStream.Seek(0, SeekOrigin.Begin);
var streamReader = new StreamReader(memoryStream, System.Text.Encoding.UTF8);
var utf8EncodedXml = streamReader.ReadToEnd();
No, you can use a StringWriter to get rid of the intermediate MemoryStream. However, to force it into XML you need to use a StringWriter which overrides the Encoding property:
public class Utf8StringWriter : StringWriter
{
public override Encoding Encoding => Encoding.UTF8;
}
Or if you're not using C# 6 yet:
public class Utf8StringWriter : StringWriter
{
public override Encoding Encoding { get { return Encoding.UTF8; } }
}
Then:
var serializer = new XmlSerializer(typeof(SomeSerializableObject));
string utf8;
using (StringWriter writer = new Utf8StringWriter())
{
serializer.Serialize(writer, entry);
utf8 = writer.ToString();
}
Obviously you can make Utf8StringWriter into a more general class which accepts any encoding in its constructor - but in my experience UTF-8 is by far the most commonly required "custom" encoding for a StringWriter :)
Now as Jon Hanna says, this will still be UTF-16 internally, but presumably you're going to pass it to something else at some point, to convert it into binary data... at that point you can use the above string, convert it into UTF-8 bytes, and all will be well - because the XML declaration will specify "utf-8" as the encoding.
EDIT: A short but complete example to show this working:
using System;
using System.Text;
using System.IO;
using System.Xml.Serialization;
public class Test
{
public int X { get; set; }
static void Main()
{
Test t = new Test();
var serializer = new XmlSerializer(typeof(Test));
string utf8;
using (StringWriter writer = new Utf8StringWriter())
{
serializer.Serialize(writer, t);
utf8 = writer.ToString();
}
Console.WriteLine(utf8);
}
public class Utf8StringWriter : StringWriter
{
public override Encoding Encoding => Encoding.UTF8;
}
}
Result:
<?xml version="1.0" encoding="utf-8"?>
<Test xmlns:xsd="http://www.w3.org/2001/XMLSchema"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<X>0</X>
</Test>
Note the declared encoding of "utf-8" which is what we wanted, I believe.
Your code doesn't get the UTF-8 into memory as you read it back into a string again, so its no longer in UTF-8, but back in UTF-16 (though ideally its best to consider strings at a higher level than any encoding, except when forced to do so).
To get the actual UTF-8 octets you could use:
var serializer = new XmlSerializer(typeof(SomeSerializableObject));
var memoryStream = new MemoryStream();
var streamWriter = new StreamWriter(memoryStream, System.Text.Encoding.UTF8);
serializer.Serialize(streamWriter, entry);
byte[] utf8EncodedXml = memoryStream.ToArray();
I've left out the same disposal you've left. I slightly favour the following (with normal disposal left in):
var serializer = new XmlSerializer(typeof(SomeSerializableObject));
using(var memStm = new MemoryStream())
using(var xw = XmlWriter.Create(memStm))
{
serializer.Serialize(xw, entry);
var utf8 = memStm.ToArray();
}
Which is much the same amount of complexity, but does show that at every stage there is a reasonable choice to do something else, the most pressing of which is to serialise to somewhere other than to memory, such as to a file, TCP/IP stream, database, etc. All in all, it's not really that verbose.
Very good answer using inheritance, just remember to override the initializer
public class Utf8StringWriter : StringWriter
{
public Utf8StringWriter(StringBuilder sb) : base (sb)
{
}
public override Encoding Encoding { get { return Encoding.UTF8; } }
}
I found this blog post which explains the problem very well, and defines a few different solutions:
(dead link removed)
I've settled for the idea that the best way to do it is to completely omit the XML declaration when in memory. It actually is UTF-16 at that point anyway, but the XML declaration doesn't seem meaningful until it has been written to a file with a particular encoding; and even then the declaration is not required. It doesn't seem to break deserialization, at least.
As #Jon Hanna mentions, this can be done with an XmlWriter created like this:
XmlWriter writer = XmlWriter.Create (output, new XmlWriterSettings() { OmitXmlDeclaration = true });

Categories