BOM encoding for database storage

BOM encoding for database storage - c#

I'm using the following code to serialise an object:
public static string Serialise(IMessageSerializer messageSerializer, DelayMessage message)
{
using (var stream = new MemoryStream())
{
messageSerializer.Serialize(new[] { message }, stream);
return Encoding.UTF8.GetString(stream.ToArray());
}
}
Unfortunately, when I save it to a database (using LINQ to SQL), then query the database, the string appears to start with a question mark:
?<z:anyType xmlns...
How do I get rid of that? When I try to de-serialise using the following:
public static DelayMessage Deserialise(IMessageSerializer messageSerializer, string data)
{
using (var stream = new MemoryStream(Encoding.UTF8.GetBytes(data)))
{
return (DelayMessage)messageSerializer.Deserialize(stream)[0];
}
}
I get the following exception:
"Error in line 1 position 1. Expecting
element 'anyType' from namespace
'http://schemas.microsoft.com/2003/10/Serialization/'..
Encountered 'Text' with name '',
namespace ''. "
The implementations of the messageSerializer use the DataContractSerializer as follows:
public void Serialize(IMessage[] messages, Stream stream)
{
var xws = new XmlWriterSettings { ConformanceLevel = ConformanceLevel.Fragment };
using (var xmlWriter = XmlWriter.Create(stream, xws))
{
var dcs = new DataContractSerializer(typeof(IMessage), knownTypes);
foreach (var message in messages)
{
dcs.WriteObject(xmlWriter, message);
}
}
}
public IMessage[] Deserialize(Stream stream)
{
var xrs = new XmlReaderSettings { ConformanceLevel = ConformanceLevel.Fragment };
using (var xmlReader = XmlReader.Create(stream, xrs))
{
var dcs = new DataContractSerializer(typeof(IMessage), knownTypes);
var messages = new List<IMessage>();
while (false == xmlReader.EOF)
{
var message = (IMessage)dcs.ReadObject(xmlReader);
messages.Add(message);
}
return messages.ToArray();
}
}

Unfortunately, when I save it to a database (using LINQ to SQL), then query the database, the string appears to start with a question mark:
?<z:anyType xmlns...
Your database is not set up to support Unicode characters. You write a string including a BOM in it, the database can't store it so mangles it into a '?'. Then when you come back to read the string as XML, the '?' is text content outside the root element and you get an error. (You can only have whitespace text outside the root element.)
Why is the BOM getting there? Because Microsoft love dropping BOMs all over the the place, even when they're not needed (and they never are, with UTF-8). The solution is to make your own instance of UTF8Encoding instead of using the built-in Encoding.UTF8, and tell it you don't want its stupid BOMs:
Encoding utf8onlynotasridiculouslysucky= new UTF8Encoding(false);
However, this is only really masking the real issue, which is the database configuration.

Related

Invalid OpenXml after converting from XElement

I'm using this code to convert from an XElement to OpenXmlElement
internal static OpenXmlElement ToOpenXml(this XElement xel)
{
using (var sw = new StreamWriter(new MemoryStream()))
{
sw.Write(xel.ToString());
sw.Flush();
sw.BaseStream.Seek(0, SeekOrigin.Begin);
var re = OpenXmlReader.Create(sw.BaseStream);
re.Read();
var oxe = re.LoadCurrentElement();
re.Close();
return oxe;
}
}
Before the conversion I have an XElement
<w:ind w:firstLine="0" w:left="0" w:right="0"/>
After the conversion it looks like this
<w:ind w:firstLine="0" w:end="0" w:start="0"/>
This element then fails OpenXml validation using the following
var v = new OpenXmlValidator();
var errs = v.Validate(doc);
With the errors being reported:
Description="The 'http://schemas.openxmlformats.org/wordprocessingml/2006/main:start' attribute is not declared."
Description="The 'http://schemas.openxmlformats.org/wordprocessingml/2006/main:end' attribute is not declared."
Do I need to do other things to add these attributes to the schema or do I need to find a new way to convert from XElement to OpenXml?
I'm using the nuget package DocumentFormat.OpenXml ver 2.9.1 (the latest).
EDIT: Looking at the OpenXml standard, it seems that both left/start and right/end should be recognised which would point to the OpenXmlValidator not being quite correct. Presumably I can just ignore those validation errors then?
Many thx

The short answer is that you can indeed ignore those specific validation errors. The OpenXmlValidator is not up-to-date in this case.
I would additionally offer a more elegant implementation of your ToOpenXml method (note the using declarations, which were added in C# 8.0).
internal static OpenXmlElement ToOpenXmlElement(this XElement element)
{
// Write XElement to MemoryStream.
using var stream = new MemoryStream();
element.Save(stream);
stream.Seek(0, SeekOrigin.Begin);
// Read OpenXmlElement from MemoryStream.
using OpenXmlReader reader = OpenXmlReader.Create(stream);
reader.Read();
return reader.LoadCurrentElement();
}
If you don't use C# 8.0 or using declarations, here's the corresponding code with using statements.
internal static OpenXmlElement ToOpenXmlElement(this XElement element)
{
using (var stream = new MemoryStream())
{
// Write XElement to MemoryStream.
element.Save(stream);
stream.Seek(0, SeekOrigin.Begin);
// Read OpenXmlElement from MemoryStream.
using OpenXmlReader reader = OpenXmlReader.Create(stream);
{
reader.Read();
return reader.LoadCurrentElement();
}
}
}
Here's the corresponding unit test, which also demonstrates that you'd have to pass a w:document to have the w:ind element's attributes changed by the Indentation instance created in the process.
public class OpenXmlReaderTests
{
private const string NamespaceUriW = "http://schemas.openxmlformats.org/wordprocessingml/2006/main";
private static readonly string XmlnsW = $"xmlns:w=\"{NamespaceUriW}\"";
private static readonly string IndText =
$#"<w:ind {XmlnsW} w:firstLine=""10"" w:left=""20"" w:right=""30""/>";
private static readonly string DocumentText =
$#"<w:document {XmlnsW}><w:body><w:p><w:pPr>{IndText}</w:pPr></w:p></w:body></w:document>";
[Fact]
public void ConvertingDocumentChangesIndProperties()
{
XElement element = XElement.Parse(DocumentText);
var document = (Document) element.ToOpenXmlElement();
Indentation ind = document.Descendants<Indentation>().First();
Assert.Null(ind.Left);
Assert.Null(ind.Right);
Assert.Equal("10", ind.FirstLine);
Assert.Equal("20", ind.Start);
Assert.Equal("30", ind.End);
}
[Fact]
public void ConvertingIndDoesNotChangeIndProperties()
{
XElement element = XElement.Parse(IndText);
var ind = (OpenXmlUnknownElement) element.ToOpenXmlElement();
Assert.Equal("10", ind.GetAttribute("firstLine", NamespaceUriW).Value);
Assert.Equal("20", ind.GetAttribute("left", NamespaceUriW).Value);
Assert.Equal("30", ind.GetAttribute("right", NamespaceUriW).Value);
}
}

How to stop invalid characters in xml with a limited (windows-1251) encoding in c#

this one's puzzling me no end. I need to send an xml message to a Russian webservice. The XML has to be encoded in windows-1251
I have a number of objects that respond to the different types of messages and I turn them into xml thus:
public string Serialise(Type t, object o, XmlSerializerNamespaces Namespaces)
{
XmlSerializer serialiser = _serialisers.First(s => s.GetType().FullName.Contains(t.Name));
Windows1251StringWriter myWriter = new Windows1251StringWriter();
serialiser.Serialize(myWriter, o, Namespaces);
return myWriter.ToString();
}
public class Windows1251StringWriter : StringWriter
{
public override Encoding Encoding
{
get { return Encoding.GetEncoding(1251); }
}
}
which works fine but the web service rejects requests if we send any characters that aren't in windows-1251. In the latest example I tried to send a phone number with 'LEFT-TO-RIGHT EMBEDDING' (U+202A), 'NON-BREAKING HYPHEN' (U+2011) and god help us 'POP DIRECTIONAL FORMATTING' (U+202C). I have no control over the input. I'd like to turn any unknown characters into ? or remove them. I've tried messing with the EncoderFallback but it doesn't seem to change anything.
Am I going about this wrong?

Since you are serializing to a string, the only thing the Encoding property in Windows1251StringWriter does for you is to change the name of the encoding shown in the XML:
<?xml version="1.0" encoding="windows-1251"?>
(I think this trick comes from here.)
And that's it. All c# strings are always encoded in utf-16 and the base class StringWriter writes to this encoding no matter what, regardless of whether the Encoding property is overridden.
To strip away characters from your XML that are invalid in some specific encoding, you need to encode it down to a byte stream, then decode it, e.g. with the following:
public static class XmlSerializationHelper
{
public static string GetXml<X>(this X toSerialize, XmlSerializer serializer = null, XmlSerializerNamespaces namespaces = null, Encoding encoding = null)
{
if (toSerialize == null)
throw new ArgumentNullException();
encoding = encoding ?? Encoding.UTF8;
serializer = serializer ?? new XmlSerializer(toSerialize.GetType());
using (var stream = new MemoryStream())
using (var writer = new StreamWriter(stream, encoding))
{
serializer.Serialize(writer, toSerialize, namespaces);
writer.Flush();
stream.Position = 0;
using (var reader = new StreamReader(stream, encoding))
{
return reader.ReadToEnd();
}
}
}
}
Then do
var encoding = Encoding.GetEncoding(1251, new EncoderReplacementFallback(""), new DecoderExceptionFallback());
return o.GetXml(serialiser, Namespaces, encoding);

Well Formed XML using Service Stack

I'm building an MVC5 application which pulls records from a database and allows a user to perform some basic data cleansing edits.
Once the data has been cleansed it needs to be exported as XML, run through a validator and then uploaded to a third party portal.
I'm using Service Stack, and I've found it fairly quick and straightforward in the past, particularly when outputting to CSV.
The one issue I'm having is with the XML serialzer. I'm not sure how to make it generate well formed XML.
The file that i'm getting simply dumps it on one line, which won't validate because it isn't well formed.
below is an extract from my controller action:
Response.Clear();
Response.ContentType = "text/xml";
Response.AddHeader("Content-Disposition", "attachment; filename="myFile.xml"");
XmlSerializer.SerializeToStream(viewModel, Response.OutputStream);
Response.End();
UPDATE: Thanks for the useful comments, as explained I'm not talking about pretty printing, the issue is I need to run the file through a validator before uploading it to a third party. The error message the validator is throwing is Error:0000, XML not well-formed. Cannot have more than one tag on one line.

Firstly, be aware that most white space (including new lines) in XML is insignificant -- it has no meaning, and is only for beautification. The lack of new lines doesn't make the XML ill-formed. See White Space in XML Documents or https://www.w3.org/TR/REC-xml/#sec-white-space. Thus in theory it shouldn't matter whether ServiceStack's XmlSerializer is putting all of your XML on a single line.
That being said, if for whatever reason you must cosmetically break your XML up into multiple lines, you'll need to do a little work. From the source code we can see that XmlSerializer uses DataContractSerializer with a hardcoded static XmlWriterSettings that does not allow for setting XmlWriterSettings.Indent = true. However, since this class is just a very thin wrapper on Microsoft's data contract serializer, you can substitute your own code:
public static class DataContractSerializerHelper
{
private static readonly XmlWriterSettings xmlWriterSettings = new XmlWriterSettings { Indent = true, IndentChars = " " };
public static string SerializeToString<T>(T from)
{
try
{
using (var ms = new MemoryStream())
using (var xw = XmlWriter.Create(ms, xmlWriterSettings))
{
var serializer = new DataContractSerializer(from.GetType());
serializer.WriteObject(xw, from);
xw.Flush();
ms.Seek(0, SeekOrigin.Begin);
var reader = new StreamReader(ms);
return reader.ReadToEnd();
}
}
catch (Exception ex)
{
throw new SerializationException(string.Format("Error serializing \"{0}\"", from), ex);
}
}
public static void SerializeToWriter<T>(T value, TextWriter writer)
{
try
{
using (var xw = XmlWriter.Create(writer, xmlWriterSettings))
{
var serializer = new DataContractSerializer(value.GetType());
serializer.WriteObject(xw, value);
}
}
catch (Exception ex)
{
throw new SerializationException(string.Format("Error serializing \"{0}\"", value), ex);
}
}
public static void SerializeToStream(object obj, Stream stream)
{
if (obj == null)
return;
using (var xw = XmlWriter.Create(stream, xmlWriterSettings))
{
var serializer = new DataContractSerializer(obj.GetType());
serializer.WriteObject(xw, obj);
}
}
}
And then do:
DataContractSerializerHelper.SerializeToStream(viewModel, Response.OutputStream);

Avoid XML Escape Double Quote

I'm currently trying to serialize a class into XML to be posted to php web service.
Whenever I did the normal serialization using XMLSerializer, XML declaration is always appear in the first line of the XML document (similar as to <?xml ....?>). I tested the XML and unable to get it working because the endpoint does not accept XML declaration and I can't do anything about it.
I'm unfamiliar with XML Serialization in C# to be honest.
Therefore, I used XMLWriter to do this as below :-
private string SerializeClassToString(GetRiskReport value)
{
var emptyNS = new XmlSerializerNamespaces(new[] { XmlQualifiedName.Empty });
var ser = new XmlSerializer(value.GetType());
var settings = new XmlWriterSettings();
settings.OmitXmlDeclaration = true;
using (var stream = new StringWriter())
{
using (var writer = XmlWriter.Create(stream, settings))
{
ser.Serialize(writer, value, emptyNS);
return stream.ToString();
}
}
}
Result for the Namespace is
<GetRiskReport FCRA=\"false\" ReturnResultsOnly=\"false\" Monitoring=\"false\">
... and I'm able to omit the XML Declaration, however I'm being introduced with 2 new problem.
I got \r\n for new line and I have escaped double quote such as ReturnResultsOnly=\"false\" Monitoring=\"false\" which is also unable processed by the endpoint.
I would like to ask is that does anyone can give me an idea on how to change the XmlWriterSetting to omit XML Declaration, avoid \r\n and also avoid escaped double quotes \"
Thanks for your advice in advance.
Simon

Try with following settings
settings.NewLineHandling = NewLineHandling.None;
settings.CheckCharacters = false;

private void SerializeClassToString(GetRiskReport value)
{
var emptyNS = new XmlSerializerNamespaces(new[]{XmlQualifiedName.Empty});
var ser = new XmlSerializer(value.GetType());
var settings = new XmlWriterSettings();
settings.OmitXmlDeclaration = true;
string path = 'your_file_path_here'
if (File.Exists(path)) File.Delete(path);
FileStream stream = File.Create(path);
using (var writer = XmlWriter.Create(stream, settings))
{
ser.Serialize(writer, value, emptyNS);
return;
}
}
There was no way to avoid ms bug or thier intensional specification about xmlserializing.It's easier and faster to use filestream object.

Serializing an object as UTF-8 XML in .NET

Proper object disposal removed for brevity but I'm shocked if this is the simplest way to encode an object as UTF-8 in memory. There has to be an easier way doesn't there?
var serializer = new XmlSerializer(typeof(SomeSerializableObject));
var memoryStream = new MemoryStream();
var streamWriter = new StreamWriter(memoryStream, System.Text.Encoding.UTF8);
serializer.Serialize(streamWriter, entry);
memoryStream.Seek(0, SeekOrigin.Begin);
var streamReader = new StreamReader(memoryStream, System.Text.Encoding.UTF8);
var utf8EncodedXml = streamReader.ReadToEnd();

No, you can use a StringWriter to get rid of the intermediate MemoryStream. However, to force it into XML you need to use a StringWriter which overrides the Encoding property:
public class Utf8StringWriter : StringWriter
{
public override Encoding Encoding => Encoding.UTF8;
}
Or if you're not using C# 6 yet:
public class Utf8StringWriter : StringWriter
{
public override Encoding Encoding { get { return Encoding.UTF8; } }
}
Then:
var serializer = new XmlSerializer(typeof(SomeSerializableObject));
string utf8;
using (StringWriter writer = new Utf8StringWriter())
{
serializer.Serialize(writer, entry);
utf8 = writer.ToString();
}
Obviously you can make Utf8StringWriter into a more general class which accepts any encoding in its constructor - but in my experience UTF-8 is by far the most commonly required "custom" encoding for a StringWriter :)
Now as Jon Hanna says, this will still be UTF-16 internally, but presumably you're going to pass it to something else at some point, to convert it into binary data... at that point you can use the above string, convert it into UTF-8 bytes, and all will be well - because the XML declaration will specify "utf-8" as the encoding.
EDIT: A short but complete example to show this working:
using System;
using System.Text;
using System.IO;
using System.Xml.Serialization;
public class Test
{
public int X { get; set; }
static void Main()
{
Test t = new Test();
var serializer = new XmlSerializer(typeof(Test));
string utf8;
using (StringWriter writer = new Utf8StringWriter())
{
serializer.Serialize(writer, t);
utf8 = writer.ToString();
}
Console.WriteLine(utf8);
}
public class Utf8StringWriter : StringWriter
{
public override Encoding Encoding => Encoding.UTF8;
}
}
Result:
<?xml version="1.0" encoding="utf-8"?>
<Test xmlns:xsd="http://www.w3.org/2001/XMLSchema"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<X>0</X>
</Test>
Note the declared encoding of "utf-8" which is what we wanted, I believe.

Your code doesn't get the UTF-8 into memory as you read it back into a string again, so its no longer in UTF-8, but back in UTF-16 (though ideally its best to consider strings at a higher level than any encoding, except when forced to do so).
To get the actual UTF-8 octets you could use:
var serializer = new XmlSerializer(typeof(SomeSerializableObject));
var memoryStream = new MemoryStream();
var streamWriter = new StreamWriter(memoryStream, System.Text.Encoding.UTF8);
serializer.Serialize(streamWriter, entry);
byte[] utf8EncodedXml = memoryStream.ToArray();
I've left out the same disposal you've left. I slightly favour the following (with normal disposal left in):
var serializer = new XmlSerializer(typeof(SomeSerializableObject));
using(var memStm = new MemoryStream())
using(var xw = XmlWriter.Create(memStm))
{
serializer.Serialize(xw, entry);
var utf8 = memStm.ToArray();
}
Which is much the same amount of complexity, but does show that at every stage there is a reasonable choice to do something else, the most pressing of which is to serialise to somewhere other than to memory, such as to a file, TCP/IP stream, database, etc. All in all, it's not really that verbose.

Very good answer using inheritance, just remember to override the initializer
public class Utf8StringWriter : StringWriter
{
public Utf8StringWriter(StringBuilder sb) : base (sb)
{
}
public override Encoding Encoding { get { return Encoding.UTF8; } }
}

I found this blog post which explains the problem very well, and defines a few different solutions:
(dead link removed)
I've settled for the idea that the best way to do it is to completely omit the XML declaration when in memory. It actually is UTF-16 at that point anyway, but the XML declaration doesn't seem meaningful until it has been written to a file with a particular encoding; and even then the declaration is not required. It doesn't seem to break deserialization, at least.
As #Jon Hanna mentions, this can be done with an XmlWriter created like this:
XmlWriter writer = XmlWriter.Create (output, new XmlWriterSettings() { OmitXmlDeclaration = true });

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

BOM encoding for database storage - c#

Related

Invalid OpenXml after converting from XElement

How to stop invalid characters in xml with a limited (windows-1251) encoding in c#

Well Formed XML using Service Stack

Avoid XML Escape Double Quote

Serializing an object as UTF-8 XML in .NET

Categories

Resources