Invalid OpenXml after converting from XElement - c#

I'm using this code to convert from an XElement to OpenXmlElement
internal static OpenXmlElement ToOpenXml(this XElement xel)
{
using (var sw = new StreamWriter(new MemoryStream()))
{
sw.Write(xel.ToString());
sw.Flush();
sw.BaseStream.Seek(0, SeekOrigin.Begin);
var re = OpenXmlReader.Create(sw.BaseStream);
re.Read();
var oxe = re.LoadCurrentElement();
re.Close();
return oxe;
}
}
Before the conversion I have an XElement
<w:ind w:firstLine="0" w:left="0" w:right="0"/>
After the conversion it looks like this
<w:ind w:firstLine="0" w:end="0" w:start="0"/>
This element then fails OpenXml validation using the following
var v = new OpenXmlValidator();
var errs = v.Validate(doc);
With the errors being reported:
Description="The 'http://schemas.openxmlformats.org/wordprocessingml/2006/main:start' attribute is not declared."
Description="The 'http://schemas.openxmlformats.org/wordprocessingml/2006/main:end' attribute is not declared."
Do I need to do other things to add these attributes to the schema or do I need to find a new way to convert from XElement to OpenXml?
I'm using the nuget package DocumentFormat.OpenXml ver 2.9.1 (the latest).
EDIT: Looking at the OpenXml standard, it seems that both left/start and right/end should be recognised which would point to the OpenXmlValidator not being quite correct. Presumably I can just ignore those validation errors then?
Many thx

The short answer is that you can indeed ignore those specific validation errors. The OpenXmlValidator is not up-to-date in this case.
I would additionally offer a more elegant implementation of your ToOpenXml method (note the using declarations, which were added in C# 8.0).
internal static OpenXmlElement ToOpenXmlElement(this XElement element)
{
// Write XElement to MemoryStream.
using var stream = new MemoryStream();
element.Save(stream);
stream.Seek(0, SeekOrigin.Begin);
// Read OpenXmlElement from MemoryStream.
using OpenXmlReader reader = OpenXmlReader.Create(stream);
reader.Read();
return reader.LoadCurrentElement();
}
If you don't use C# 8.0 or using declarations, here's the corresponding code with using statements.
internal static OpenXmlElement ToOpenXmlElement(this XElement element)
{
using (var stream = new MemoryStream())
{
// Write XElement to MemoryStream.
element.Save(stream);
stream.Seek(0, SeekOrigin.Begin);
// Read OpenXmlElement from MemoryStream.
using OpenXmlReader reader = OpenXmlReader.Create(stream);
{
reader.Read();
return reader.LoadCurrentElement();
}
}
}
Here's the corresponding unit test, which also demonstrates that you'd have to pass a w:document to have the w:ind element's attributes changed by the Indentation instance created in the process.
public class OpenXmlReaderTests
{
private const string NamespaceUriW = "http://schemas.openxmlformats.org/wordprocessingml/2006/main";
private static readonly string XmlnsW = $"xmlns:w=\"{NamespaceUriW}\"";
private static readonly string IndText =
$#"<w:ind {XmlnsW} w:firstLine=""10"" w:left=""20"" w:right=""30""/>";
private static readonly string DocumentText =
$#"<w:document {XmlnsW}><w:body><w:p><w:pPr>{IndText}</w:pPr></w:p></w:body></w:document>";
[Fact]
public void ConvertingDocumentChangesIndProperties()
{
XElement element = XElement.Parse(DocumentText);
var document = (Document) element.ToOpenXmlElement();
Indentation ind = document.Descendants<Indentation>().First();
Assert.Null(ind.Left);
Assert.Null(ind.Right);
Assert.Equal("10", ind.FirstLine);
Assert.Equal("20", ind.Start);
Assert.Equal("30", ind.End);
}
[Fact]
public void ConvertingIndDoesNotChangeIndProperties()
{
XElement element = XElement.Parse(IndText);
var ind = (OpenXmlUnknownElement) element.ToOpenXmlElement();
Assert.Equal("10", ind.GetAttribute("firstLine", NamespaceUriW).Value);
Assert.Equal("20", ind.GetAttribute("left", NamespaceUriW).Value);
Assert.Equal("30", ind.GetAttribute("right", NamespaceUriW).Value);
}
}

Related

Converting xml to stream and disabling dtd breaking Deserialization

I have created the following wrapper method to disable DTD
public class Program
{
public static void Main(string[] args)
{
string s = #"<?xml version =""1.0"" encoding=""utf-16""?>
<ArrayOfSerializingTemplateItem xmlns:xsd=""http://www.w3.org/2001/XMLSchema"" xmlns:xsi=""http://www.w3.org/2001/XMLSchema-instance"">
<SerializingTemplateItem>
</SerializingTemplateItem>
</ArrayOfSerializingTemplateItem >";
try
{
XmlReader reader = XmlWrapper.CreateXmlReaderObject(s);
XmlSerializer sr = new XmlSerializer(typeof(List<SerializingTemplateItem>));
Object ob = sr.Deserialize(reader);
}
catch (Exception ex)
{
Console.WriteLine(ex);
throw;
}
Console.ReadLine();
}
}
public class XmlWrapper
{
public static XmlReader CreateXmlReaderObject(string sr)
{
byte[] byteArray = Encoding.UTF8.GetBytes(sr);
MemoryStream stream = new MemoryStream(byteArray);
stream.Position = 0;
XmlReaderSettings settings = new XmlReaderSettings();
settings.ValidationType = ValidationType.None;
settings.DtdProcessing = DtdProcessing.Ignore;
return XmlReader.Create(stream, settings);
}
}
public class SerializingTemplateItem
{
}
The above throws exception "There is no Unicode byte order mark. Cannot switch to Unicode." (Demo fiddle here: https://dotnetfiddle.net/pGxOE9).
But if I use the following code to create the XmlReader instead of calling the XmlWrapper method. It works fine.
StringReader stringReader = new StringReader( xml );
XmlReader reader = new XmlTextReader( stringReader );
But I need to use the wrapper method as a security requirement to disable DTD. I don't know why I am unable to deserialize after calling my wrapper method. Any help will be highly appreciated.
Your problem is that you have encoded the XML into a MemoryStream using Encoding.UTF8, but the XML string itself claims to be encoded in UTF-16 in the encoding declaration in its XML text declaration:
<?xml version ="1.0" encoding="utf-16"?>
<ArrayOfSerializingTemplateItem>
<!-- Content omitted -->
</ArrayOfSerializingTemplateItem >
Apparently when the XmlReader encounters this declaration, it tries honor the declaration and switch from UTF-8 to UTF-16 but fails for some reason - possibly because the stream really is encoded in UTF-8. Conversely when the deprecated XmlTextReader encounters the declaration, it apparently just ignores it as not implemented, which happens to cause things to work successfully in this situation.
The simplest way to resolve this is to read directly from the string using a StringReader using XmlReader.Create(TextReader, XmlReaderSettings):
public class XmlWrapper
{
public static XmlReader CreateXmlReaderObject(string sr)
{
var settings = new XmlReaderSettings
{
ValidationType = ValidationType.None,
DtdProcessing = DtdProcessing.Ignore,
};
return XmlReader.Create(new StringReader(sr), settings);
}
}
Since a c# string is always encoded internally in UTF-16 the encoding statement in the XML will be ignored as irrelevant. This will also be more performant as the conversion to an intermediate byte array is completely skipped.
Incidentally, you should dispose of your XmlReader via a using statement:
Object ob;
using (var reader = XmlWrapper.CreateXmlReaderObject(s))
{
XmlSerializer sr = new XmlSerializer(typeof(List<SerializingTemplateItem>));
ob = sr.Deserialize(reader);
}
Working sample fiddle here.
Related questions:
Meaning of - <?xml version="1.0" encoding="utf-8"?>
Ignoring specified encoding when deserializing XML

Problem with XElement and XslCompiledTransform

I'm having some trouble using a combination of XElement and XslCompiledTransform. I've put the sample code I'm using below. If I get my input XML using the GetXmlDocumentXml() method, it works fine. If I use the GetXElementXml() method instead, I get an InvalidOperationException when calling the Transform method of XslComiledTransform:
Token Text in state Start would result in an invalid XML document. Make sure that the ConformanceLevel setting is set to ConformanceLevel.Fragment or ConformanceLevel.Auto if you want to write an XML fragment.
The CreateNavigator method on both XElement and XmlDocument returns an XPathNavigator. What extra stuff is XmlDocument doing so this all works, and how can I do the same with XElement? Am I just doing something insane?
static void Main(string[] args)
{
XslCompiledTransform stylesheet = GetStylesheet(); // not shown for brevity
IXPathNavigable input = this.GetXElementXml();
using (MemoryStream ms = this.TransformXml(input, stylesheet))
{
XmlReader xr = XmlReader.Create(ms);
xr.MoveToContent();
}
}
private MemoryStream TransformXml(
IXPathNavigable xml,
XslCompiledTransform stylesheet)
{
MemoryStream transformed = new MemoryStream();
XmlWriter writer = XmlWriter.Create(transformed);
stylesheet.Transform(xml, null, writer);
transformed.Position = 0;
return transformed;
}
private IXPathNavigable GetXElementXml()
{
var xml = new XElement("x", new XElement("y", "sds"));
return xml.CreateNavigator();
}
private IXPathNavigable GetXmlDocumentXml()
{
var xml = new XmlDocument();
xml.LoadXml("<x><y>sds</y></x>");
return xml.CreateNavigator();
}
Oh, that was easy. The solution was to wrap the XElement in an XDocument object. Problem solved!

How can I read information from an XML file and set them to attributes in a class?

I have this code:
public class Hero
{
XmlReader Reader = new XmlTextReader("InformationRepositories/HeroRepository/HeroInformation.xml");
XmlReaderSettings XMLSettings = new XmlReaderSettings();
public ImageSource GetHeroIcon(string Name)
{
XMLSettings.IgnoreWhitespace = true;
XMLSettings.IgnoreComments = true;
Reader.MoveToAttribute(" //I'm pretty much stuck here.
}
}
And this is the XML file I want to read from:
<?xml version="1.0" encoding="utf-8" ?>
<Hero>
<Legion>
<Andromeda>
<HeroType>Agility</HeroType>
<Damage>39-53</Damage>
<Armor>3.1</Armor>
<MoveSpeed>295</MoveSpeed>
<AttackType>Ranged(400)</AttackType>
<AttackRate>.75</AttackRate>
<Strength>16</Strength>
<Agility>27</Agility>
<Intelligence>15</Intelligence>
<Icon>Images/Hero/Andromeda.gif</Icon>
</Andromeda>
</Legion>
<Hellbourne>
</Hellbourne>
</Hero>
I'm tring to get the ,/Icon> element.
MoveToAttribute() won't help you, because everything in your XML is elements. The Icon element is a subelement of the Andromeda element.
One of the easiest ways of navigating an XML document if you're using the pre-3.5 xml handling is by using an XPathNavigator. See this example for getting started, but basically you just need to create it and call MoveToChild() or MoveToFollowing() and it'll get you to where you want to be in the document.
XmlDocument doc = new XmlDocument();
doc.Load("InformationRepositories/HeroRepository/HeroInformation.xml");
XPathNavigator nav = doc.CreateNavigator();
if (nav.MoveToFollowing("Icon",""))
Response.Write(nav.ValueAsInt);
Note that an XPathNavigator is a forward only mechanism, so it can be problematic if you need to do looping or seeking through the document.
If you're just reading XML to put the values into objects, you should seriously consider doing this automatically via object serialization to XML. This would give you a painless and automatic way to load your xml files back into objects.
Mark your attributes in your object according to the element you want to load them to:
See: http://msdn.microsoft.com/en-us/library/system.xml.serialization.xmlattributeattribute.aspx
If, for some reason, you can't do this to your current object, consider making a bridge object which mirrors your original object and add a AsOriginal() method which returns the Original Object.
Working off the msdn example:
public class GroupBridge
{
[XmlAttribute (Namespace = "http://www.cpandl.com")]
public string GroupName;
[XmlAttribute(DataType = "base64Binary")]
public Byte [] GroupNumber;
[XmlAttribute(DataType = "date", AttributeName = "CreationDate")]
public DateTime Today;
public Group AsOriginal()
{
Group g = new Group();
g.GroupName = this.GroupName;
g.GroupNumber = this.GroupNumber;
g.Today = this.Today;
return g;
}
}
public class Group
{
public string GroupName;
public Byte [] GroupNumber;
public DateTime Today;
}
To Serialize and DeSerialize from LINQ objects, you can use:
public static string SerializeLINQtoXML<T>(T linqObject)
{
// see http://msdn.microsoft.com/en-us/library/bb546184.aspx
DataContractSerializer dcs = new DataContractSerializer(linqObject.GetType());
StringBuilder sb = new StringBuilder();
XmlWriter writer = XmlWriter.Create(sb);
dcs.WriteObject(writer, linqObject);
writer.Close();
return sb.ToString();
}
public static T DeserializeLINQfromXML<T>(string input)
{
DataContractSerializer dcs = new DataContractSerializer(typeof(T));
TextReader treader = new StringReader(input);
XmlReader reader = XmlReader.Create(treader);
T linqObject = (T)dcs.ReadObject(reader, true);
reader.Close();
return linqObject;
}
I don't have any example code of Serialization from non-LINQ objects, but the MSDN link should point you in the right direction.
You can use linq to xml:
public class XmlTest
{
private XDocument _doc;
public XmlTest(string xml)
{
_doc = XDocument.Load(new StringReader(xml);
}
public string Icon { get { return GetValue("Icon"); } }
private string GetValue(string elementName)
{
return _doc.Descendants(elementName).FirstOrDefault().Value;
}
}
you can use this regular exspression "<Icon>.*</Icon>" to find all the icons
then just remove the remove the tag, and use it....
would be a lot shorter
Regex rgx = new Regex("<Icon>.*</Icon>");
MatchCollection matches = rgx.Matches(xml);
foreach (Match match in matches)
{
string s= match.Value;
s= s.Remove(0,6)
s= s.Remove(s.LastIndexOf("</Icon>"),7);
console.Writeline(s);
}

BOM encoding for database storage

I'm using the following code to serialise an object:
public static string Serialise(IMessageSerializer messageSerializer, DelayMessage message)
{
using (var stream = new MemoryStream())
{
messageSerializer.Serialize(new[] { message }, stream);
return Encoding.UTF8.GetString(stream.ToArray());
}
}
Unfortunately, when I save it to a database (using LINQ to SQL), then query the database, the string appears to start with a question mark:
?<z:anyType xmlns...
How do I get rid of that? When I try to de-serialise using the following:
public static DelayMessage Deserialise(IMessageSerializer messageSerializer, string data)
{
using (var stream = new MemoryStream(Encoding.UTF8.GetBytes(data)))
{
return (DelayMessage)messageSerializer.Deserialize(stream)[0];
}
}
I get the following exception:
"Error in line 1 position 1. Expecting
element 'anyType' from namespace
'http://schemas.microsoft.com/2003/10/Serialization/'..
Encountered 'Text' with name '',
namespace ''. "
The implementations of the messageSerializer use the DataContractSerializer as follows:
public void Serialize(IMessage[] messages, Stream stream)
{
var xws = new XmlWriterSettings { ConformanceLevel = ConformanceLevel.Fragment };
using (var xmlWriter = XmlWriter.Create(stream, xws))
{
var dcs = new DataContractSerializer(typeof(IMessage), knownTypes);
foreach (var message in messages)
{
dcs.WriteObject(xmlWriter, message);
}
}
}
public IMessage[] Deserialize(Stream stream)
{
var xrs = new XmlReaderSettings { ConformanceLevel = ConformanceLevel.Fragment };
using (var xmlReader = XmlReader.Create(stream, xrs))
{
var dcs = new DataContractSerializer(typeof(IMessage), knownTypes);
var messages = new List<IMessage>();
while (false == xmlReader.EOF)
{
var message = (IMessage)dcs.ReadObject(xmlReader);
messages.Add(message);
}
return messages.ToArray();
}
}
Unfortunately, when I save it to a database (using LINQ to SQL), then query the database, the string appears to start with a question mark:
?<z:anyType xmlns...
Your database is not set up to support Unicode characters. You write a string including a BOM in it, the database can't store it so mangles it into a '?'. Then when you come back to read the string as XML, the '?' is text content outside the root element and you get an error. (You can only have whitespace text outside the root element.)
Why is the BOM getting there? Because Microsoft love dropping BOMs all over the the place, even when they're not needed (and they never are, with UTF-8). The solution is to make your own instance of UTF8Encoding instead of using the built-in Encoding.UTF8, and tell it you don't want its stupid BOMs:
Encoding utf8onlynotasridiculouslysucky= new UTF8Encoding(false);
However, this is only really masking the real issue, which is the database configuration.

streaming XML serialization in .net

I'm trying to serialize a very large IEnumerable<MyObject> using an XmlSerializer without keeping all the objects in memory.
The IEnumerable<MyObject> is actually lazy..
I'm looking for a streaming solution that will:
Take an object from the IEnumerable<MyObject>
Serialize it to the underlying stream using the standard serialization (I don't want to handcraft the XML here!)
Discard the in memory data and move to the next
I'm trying with this code:
using (var writer = new StreamWriter(filePath))
{
var xmlSerializer = new XmlSerializer(typeof(MyObject));
foreach (var myObject in myObjectsIEnumerable)
{
xmlSerializer.Serialize(writer, myObject);
}
}
but I'm getting multiple XML headers and I cannot specify a root tag <MyObjects> so my XML is invalid.
Any idea?
Thanks
The XmlWriter class is a fast streaming API for XML generation. It is rather low-level, MSDN has an article on instantiating a validating XmlWriter using XmlWriter.Create().
Edit: link fixed. Here is sample code from the article:
async Task TestWriter(Stream stream)
{
XmlWriterSettings settings = new XmlWriterSettings();
settings.Async = true;
using (XmlWriter writer = XmlWriter.Create(stream, settings)) {
await writer.WriteStartElementAsync("pf", "root", "http://ns");
await writer.WriteStartElementAsync(null, "sub", null);
await writer.WriteAttributeStringAsync(null, "att", null, "val");
await writer.WriteStringAsync("text");
await writer.WriteEndElementAsync();
await writer.WriteCommentAsync("cValue");
await writer.WriteCDataAsync("cdata value");
await writer.WriteEndElementAsync();
await writer.FlushAsync();
}
}
Here's what I use:
using System;
using System.Collections.Generic;
using System.Xml;
using System.Xml.Serialization;
using System.Text;
using System.IO;
namespace Utils
{
public class XMLSerializer
{
public static Byte[] StringToUTF8ByteArray(String xmlString)
{
return new UTF8Encoding().GetBytes(xmlString);
}
public static String SerializeToXML<T>(T objectToSerialize)
{
StringBuilder sb = new StringBuilder();
XmlWriterSettings settings =
new XmlWriterSettings {Encoding = Encoding.UTF8, Indent = true};
using (XmlWriter xmlWriter = XmlWriter.Create(sb, settings))
{
if (xmlWriter != null)
{
new XmlSerializer(typeof(T)).Serialize(xmlWriter, objectToSerialize);
}
}
return sb.ToString();
}
public static void DeserializeFromXML<T>(string xmlString, out T deserializedObject) where T : class
{
XmlSerializer xs = new XmlSerializer(typeof (T));
using (MemoryStream memoryStream = new MemoryStream(StringToUTF8ByteArray(xmlString)))
{
deserializedObject = xs.Deserialize(memoryStream) as T;
}
}
}
}
Then just call:
string xml = Utils.SerializeToXML(myObjectsIEnumerable);
I haven't tried it with, for example, an IEnumerable that fetches objects one at a time remotely, or any other weird use cases, but it works perfectly for List<T> and other collections that are in memory.
EDIT: Based on your comments in response to this, you could use XmlDocument.LoadXml to load the resulting XML string into an XmlDocument, save the first one to a file, and use that as your master XML file. For each item in the IEnumerable, use LoadXml again to create a new in-memory XmlDocument, grab the nodes you want, append them to the master document, and save it again, getting rid of the new one.
After you're finished, there may be a way to wrap all of the nodes in your root tag. You could also use XSL and XslCompiledTransform to write another XML file with the objects properly wrapped in the root tag.
You can do this by implementing the IXmlSerializable interface on the large class. The implementation of the WriteXml method can write the start tag, then simply loop over the IEnumerable<MyObject> and serialize each MyObject to the same XmlWriter, one at a time.
In this implementation, there won't be any in-memory data to get rid of (past what the garbage collector will collect).

Categories