streaming XML serialization in .net - c#

I'm trying to serialize a very large IEnumerable<MyObject> using an XmlSerializer without keeping all the objects in memory.
The IEnumerable<MyObject> is actually lazy..
I'm looking for a streaming solution that will:
Take an object from the IEnumerable<MyObject>
Serialize it to the underlying stream using the standard serialization (I don't want to handcraft the XML here!)
Discard the in memory data and move to the next
I'm trying with this code:
using (var writer = new StreamWriter(filePath))
{
var xmlSerializer = new XmlSerializer(typeof(MyObject));
foreach (var myObject in myObjectsIEnumerable)
{
xmlSerializer.Serialize(writer, myObject);
}
}
but I'm getting multiple XML headers and I cannot specify a root tag <MyObjects> so my XML is invalid.
Any idea?
Thanks

The XmlWriter class is a fast streaming API for XML generation. It is rather low-level, MSDN has an article on instantiating a validating XmlWriter using XmlWriter.Create().
Edit: link fixed. Here is sample code from the article:
async Task TestWriter(Stream stream)
{
XmlWriterSettings settings = new XmlWriterSettings();
settings.Async = true;
using (XmlWriter writer = XmlWriter.Create(stream, settings)) {
await writer.WriteStartElementAsync("pf", "root", "http://ns");
await writer.WriteStartElementAsync(null, "sub", null);
await writer.WriteAttributeStringAsync(null, "att", null, "val");
await writer.WriteStringAsync("text");
await writer.WriteEndElementAsync();
await writer.WriteCommentAsync("cValue");
await writer.WriteCDataAsync("cdata value");
await writer.WriteEndElementAsync();
await writer.FlushAsync();
}
}

Here's what I use:
using System;
using System.Collections.Generic;
using System.Xml;
using System.Xml.Serialization;
using System.Text;
using System.IO;
namespace Utils
{
public class XMLSerializer
{
public static Byte[] StringToUTF8ByteArray(String xmlString)
{
return new UTF8Encoding().GetBytes(xmlString);
}
public static String SerializeToXML<T>(T objectToSerialize)
{
StringBuilder sb = new StringBuilder();
XmlWriterSettings settings =
new XmlWriterSettings {Encoding = Encoding.UTF8, Indent = true};
using (XmlWriter xmlWriter = XmlWriter.Create(sb, settings))
{
if (xmlWriter != null)
{
new XmlSerializer(typeof(T)).Serialize(xmlWriter, objectToSerialize);
}
}
return sb.ToString();
}
public static void DeserializeFromXML<T>(string xmlString, out T deserializedObject) where T : class
{
XmlSerializer xs = new XmlSerializer(typeof (T));
using (MemoryStream memoryStream = new MemoryStream(StringToUTF8ByteArray(xmlString)))
{
deserializedObject = xs.Deserialize(memoryStream) as T;
}
}
}
}
Then just call:
string xml = Utils.SerializeToXML(myObjectsIEnumerable);
I haven't tried it with, for example, an IEnumerable that fetches objects one at a time remotely, or any other weird use cases, but it works perfectly for List<T> and other collections that are in memory.
EDIT: Based on your comments in response to this, you could use XmlDocument.LoadXml to load the resulting XML string into an XmlDocument, save the first one to a file, and use that as your master XML file. For each item in the IEnumerable, use LoadXml again to create a new in-memory XmlDocument, grab the nodes you want, append them to the master document, and save it again, getting rid of the new one.
After you're finished, there may be a way to wrap all of the nodes in your root tag. You could also use XSL and XslCompiledTransform to write another XML file with the objects properly wrapped in the root tag.

You can do this by implementing the IXmlSerializable interface on the large class. The implementation of the WriteXml method can write the start tag, then simply loop over the IEnumerable<MyObject> and serialize each MyObject to the same XmlWriter, one at a time.
In this implementation, there won't be any in-memory data to get rid of (past what the garbage collector will collect).

Related

Well Formed XML using Service Stack

I'm building an MVC5 application which pulls records from a database and allows a user to perform some basic data cleansing edits.
Once the data has been cleansed it needs to be exported as XML, run through a validator and then uploaded to a third party portal.
I'm using Service Stack, and I've found it fairly quick and straightforward in the past, particularly when outputting to CSV.
The one issue I'm having is with the XML serialzer. I'm not sure how to make it generate well formed XML.
The file that i'm getting simply dumps it on one line, which won't validate because it isn't well formed.
below is an extract from my controller action:
Response.Clear();
Response.ContentType = "text/xml";
Response.AddHeader("Content-Disposition", "attachment; filename="myFile.xml"");
XmlSerializer.SerializeToStream(viewModel, Response.OutputStream);
Response.End();
UPDATE: Thanks for the useful comments, as explained I'm not talking about pretty printing, the issue is I need to run the file through a validator before uploading it to a third party. The error message the validator is throwing is Error:0000, XML not well-formed. Cannot have more than one tag on one line.
Firstly, be aware that most white space (including new lines) in XML is insignificant -- it has no meaning, and is only for beautification. The lack of new lines doesn't make the XML ill-formed. See White Space in XML Documents or https://www.w3.org/TR/REC-xml/#sec-white-space. Thus in theory it shouldn't matter whether ServiceStack's XmlSerializer is putting all of your XML on a single line.
That being said, if for whatever reason you must cosmetically break your XML up into multiple lines, you'll need to do a little work. From the source code we can see that XmlSerializer uses DataContractSerializer with a hardcoded static XmlWriterSettings that does not allow for setting XmlWriterSettings.Indent = true. However, since this class is just a very thin wrapper on Microsoft's data contract serializer, you can substitute your own code:
public static class DataContractSerializerHelper
{
private static readonly XmlWriterSettings xmlWriterSettings = new XmlWriterSettings { Indent = true, IndentChars = " " };
public static string SerializeToString<T>(T from)
{
try
{
using (var ms = new MemoryStream())
using (var xw = XmlWriter.Create(ms, xmlWriterSettings))
{
var serializer = new DataContractSerializer(from.GetType());
serializer.WriteObject(xw, from);
xw.Flush();
ms.Seek(0, SeekOrigin.Begin);
var reader = new StreamReader(ms);
return reader.ReadToEnd();
}
}
catch (Exception ex)
{
throw new SerializationException(string.Format("Error serializing \"{0}\"", from), ex);
}
}
public static void SerializeToWriter<T>(T value, TextWriter writer)
{
try
{
using (var xw = XmlWriter.Create(writer, xmlWriterSettings))
{
var serializer = new DataContractSerializer(value.GetType());
serializer.WriteObject(xw, value);
}
}
catch (Exception ex)
{
throw new SerializationException(string.Format("Error serializing \"{0}\"", value), ex);
}
}
public static void SerializeToStream(object obj, Stream stream)
{
if (obj == null)
return;
using (var xw = XmlWriter.Create(stream, xmlWriterSettings))
{
var serializer = new DataContractSerializer(obj.GetType());
serializer.WriteObject(xw, obj);
}
}
}
And then do:
DataContractSerializerHelper.SerializeToStream(viewModel, Response.OutputStream);

How to control the encoding while serializing and deserializing?

I'm using serialization to a string as follows.
public static string Stringify(this Process self)
{
XmlSerializer serializer = new XmlSerializer(typeof(Process));
using (StringWriter writer = new StringWriter())
{
serializer.Serialize(writer, self,);
return writer.ToString();
}
}
Then, I deserialize using this code. Please note that it's not an actual stringification from above that's used. In our business logic, it makes more sense to serialize a path, hence reading in from said path and creating an object based on the read data.
public static Process Processify(this string self)
{
XmlSerializer serializer = new XmlSerializer(typeof(Process));
using (XmlReader reader = XmlReader.Create(self))
return serializer.Deserialize(reader) as Process;
}
}
This works as supposed to except for a small issue with encoding. The string XML that's produced, contains the addition encoding="utf-16" as an attribute on the base tag (the one that's about XML version, not the actual data).
When I read in, I get an exception because of mismatching encodings. As far I could see, there's no way to specify the encoding for serialization nor deserialization in any of the objects I'm using.
How can I do that?
For now, I'm using a very brute work-around by simply cutting of the excessive junk like so. It's Q&D and I want to remove it.
public static string Stringify(this Process self)
{
XmlSerializer serializer = new XmlSerializer(typeof(Process));
using (StringWriter writer = new StringWriter())
{
serializer.Serialize(writer, self,);
return writer.ToString().Replace(" encoding=\"utf-16\"", "");
}
}

Convert Deserialization method to Async

I am trying to convert this method that deserializes an object into a string with Async/Await.
public static T DeserializeObject<T>(string xml)
{
using (StringReader reader = new StringReader(xml))
{
using (XmlReader xmlReader = XmlReader.Create(reader))
{
DataContractSerializer serializer = new DataContractSerializer(typeof(T));
T theObject = (T)serializer.ReadObject(xmlReader);
return theObject;
}
}
}
Most serialization APIs do not have async implementations, which means the only thing you can really do is wrap a sync method. For example:
public static Task<T> DeserializeObjectAsync<T>(string xml)
{
using (StringReader reader = new StringReader(xml))
{
using (XmlReader xmlReader = XmlReader.Create(reader))
{
DataContractSerializer serializer =
new DataContractSerializer(typeof(T));
T theObject = (T)serializer.ReadObject(xmlReader);
return Task.FromResult(theObject);
}
}
}
This isn't actually async - it just meets the required API. If you have the option, using ValueTask<T> is preferable in scenarios where the result may often be synchronous/
Either way, you should then be able to do something like:
var obj = await DeserializeObject<YourType>(someXml);
Debug.WriteLine(obj.Name); // etc
without needing to know whether the actual implementation was synchronous or asynchronous.
A little sample, pretty primitive way:
public delegate T Async<T>(string xml);
public void Start<T>()
{
string xml = "<Person/>";
Async<T> asyncDeserialization = DeserializeObject<T>;
asyncDeserialization.BeginInvoke(xml, Callback<T>, asyncDeserialization);
}
private void Callback<T>(IAsyncResult ar)
{
Async<T> dlg = (Async<T>)ar.AsyncState;
T item = dlg.EndInvoke(ar);
}
public T DeserializeObject<T>(string xml)
{
using (StringReader reader = new StringReader(xml))
{
using (XmlReader xmlReader = XmlReader.Create(reader))
{
DataContractSerializer serializer = new DataContractSerializer(typeof(T));
T theObject = (T)serializer.ReadObject(xmlReader);
return theObject;
}
}
}
you define a delegate and using it to Begin/End invoke using callbacks.
using the next versions of C# you can use the async keyword to get your code run asynchronously.
As you are working with a string as data source doing things async would only introduce more overhead and give you nothing for it.
But if you where reading from a stream you could copy from the source stream to a MemoryStream(buffering all data), then deserialize from the MemoryStream, that would increase the memory usage but would lower the amount of time you will block the thread.
You can return
Task.FromResult(theObject)

XMLSerializer to XElement

I have been working with XML in database LINQ and find that it is very difficult to work with the serializer.
The database LINQ required a field that store XElement.
I have a complex object with many customized structure class, so I would like to use the XmlSerializer to serialize the object.
However, the serializer can only serialize to file ("C:\xxx\xxx.xml") or a memory stream.
However to convert or serialize it to be a XElement so that I can store in the database using LINQ?
And How to do the reverse? i.e. Deserialize an XElement...
Try to use this
using (var stream = new MemoryStream())
{
serializer.Serialize(stream, value);
stream.Position = 0;
using (XmlReader reader = XmlReader.Create(stream))
{
XElement element = XElement.Load(reader);
}
}
deserialize :
XmlSerializer xs = new XmlSerializer(typeof(XElement));
using (MemoryStream ms = new MemoryStream())
{
xs.Serialize(ms, xml);
ms.Position = 0;
xs = new XmlSerializer(typeof(YourType));
object obj = xs.Deserialize(ms);
}
To make what John Saunders was describing more explicit, deserialization is very straightforward:
public static object DeserializeFromXElement(XElement element, Type t)
{
using (XmlReader reader = element.CreateReader())
{
XmlSerializer serializer = new XmlSerializer(t);
return serializer.Deserialize(reader);
}
}
Serialization is a little messier because calling CreateWriter() from an XElement or XDocument creates child elements. (In addition, the XmlWriter created from an XElement has ConformanceLevel.Fragment, which causes XmlSerialize to fail unless you use the workaround here.) As a result, I use an XDocument, since this requires a single element, and gets us around the XmlWriter issue:
public static XElement SerializeToXElement(object o)
{
var doc = new XDocument();
using (XmlWriter writer = doc.CreateWriter())
{
XmlSerializer serializer = new XmlSerializer(o.GetType());
serializer.Serialize(writer, o);
}
return doc.Root;
}
First of all, see Serialize Method to see that the serializer can handle alot more than just memory streams or files.
Second, try using XElement.CreateWriter and then passing the resulting XmlWriter to the serializer.
The SQL has XML data type may be this can help you look at msdn

XmlSerializer Converts newlines

I'm trying to serialize an object to memory, pass it to another process as a string, and deserialize it.
I've discovered that the XML Serialization process strips the \r off of the newlines for strings in the object.
byte[] b;
// serialize to memory.
using (MemoryStream ms = new MemoryStream())
{
XmlSerializer xml = new XmlSerializer(this.GetType());
xml.Serialize(ms, this);
b = ms.GetBuffer();
}
// I can now send the bytes to my process.
Process(b);
// On the other end, I use:
using (MemoryStream ms = new MemoryStream(b))
{
XmlSerializer xml = new XmlSerializer(this.GetType());
clone = (myObject)xml.Deserialize(ms);
}
How do I serialize an object without serializing it to disk just like this, but without mangling the newlines in the strings?
The strings should be wrapped in CDATA sections to preserve the newlines.
The answer came from anther SO post, but I'm reposting it here because I had to tweak it a little.
I had to create a new class to manage XML read/write to memory stream. Here it is:
public class SafeXmlSerializer : XmlSerializer
{
public SafeXmlSerializer(Type type) : base(type) { }
public new void Serialize(StreamWriter stream, object o)
{
XmlWriterSettings ws = new XmlWriterSettings();
ws.NewLineHandling = NewLineHandling.Entitize;
using (XmlWriter xmlWriter = XmlWriter.Create(stream, ws))
{
base.Serialize(xmlWriter, o);
}
}
}
Since it is built on top of XmlSerializer, it should behave exactly as expected. It's just that when I serialize with a StreamWriter, I will use the "safe" version of the serialization, thus saving myself the headache.
I hope this helps someone else.

Categories