I currently have an XML file that is rather large in size (roughly 800MB). I've tried some attempts (here is one dealing with compression) to work with it in its current condition; however, they haven't been very successful as they take quite some time.
The XML file structure is similar to below (the generation pre-dates me):
<Name>Something</Name>
<Description>Some description.</Description>
<CollectionOfObjects>
<Object>
<Name>Name Of Object</Name>
<Description>Description of object.</Description>
<AltName>Alternate name</AltName>
<ContainerName>Container</ContainerName>
<Required>true</Required>
<Length>1</Length>
<Info>
<Name>Name</Name>
<File>Filename</File>
<Size>20</Size>
<SizeUnit>MB</SizeUnit>
</Info>
</Object>
</CollectionOfObjects>
There is quite a large chunk of data under each object, and a lot of these child nodes can be made into attributes on their parents:
<CollectionOfObjects Name="Something" Description="Some description.">
<Object Name="Name Of Object" AltName="Alternate name" Container="Container" Required="true" Length="1" Description="Description of object.">
<Info Name="Name" File="Filename" Size="20" SizeUnit="MB" />
</Object>
</CollectionOfObjects>
Now, obviously not everything under each node will become an attribute; the above is just an example. There is so much data in this file it breaks Notepad and takes Visual Studio approximately 2 minutes to even open. Heaven helps you if you try to search the file because it takes an hour or longer.
You can see how this is problematic. I've done a test on the size difference (obviously not with this file) but with a demo file. I created a file and converted unnecessary child nodes into attributes and it reduced the demo files size by 53%. I have no doubt in my mind that performing the same work on this file will reduce its size by 30% or more (hoping for the more).
Now that you understand the why, let's get to the question; how do I move these child nodes to attributes. The file is generated via XmlSerializer and uses reflection to build the nodes based on the classes and properties available:
internal class DemoClass {
[CategoryAttribute("Properties"), DescriptionAttribute("The name of this object.")]
public string Name { get; set; }
}
internal bool Serialize(DemoClass demo, FileStream fs) {
XmlSerializer serializer = new XmlSerializer(typeof(DemoClass));
XmlWriterSettings settings = null;
XmlWriter writer = null;
bool result = true;
try {
settings = new XmlWriterSettings() {
Indent = true,
IndentChars = ("\t"),
Encoding = Encoding.UTF8,
NewLineOnAttributes = false,
NewLineChars = Environment.NewLine,
NewLineHandling = NewLineHandling.Replace
};
writer = XmlWriter.Create(fs, settings);
serializer.Serialize(writer, demo);
} catch { result = false; } finally { writer.Close(); }
return result;
}
It is my understanding that I can just add the XmlAttribute tag to it and it will write all future versions of the file with that tag as attributes; however, I was told that in order to convert the data from the old way to the new way I may need some kind of "binder" which I am unsure of.
Any recommendations are going to be helpful here.
NOTE: I know the following can be done to reduce file size as well (dropped by 28%):
Indent = false,
Encoding = Encoding.UTF8,
NewLineOnAttributes = false,
Update: I am currently attempting to simply use the XmlAttribute tag on properties and I've encountered an error (which I expected) where the reflection failed on deserialization:
There was an error reflecting type DemoClass.
Update 2: Now working a new angle here; I've decided to copy all of the needed classes, update them with the XmlAttribute tag; then load the old file with the old classes and write the new file with the new classes. If this works then it'll be a great workaround. However, I'm sure there's a way to do this without this workaround.
Update 3: The method in Update 2 (above) did not work the way I expected and I ended up encountering this issue. Since this approach is also heavily involved, I ended up writing a custom conversion method that used the original serialization to load the XML, then using XDocument from the System.Xml.Linq namespace, I created a new XML document by hand. This ended up being a time consuming task, but less overall change in the long run. It serializes the file in the way expected (with some tweaking here and there of course). The next step was to update the old serialization now that the old files had been converted. I've made it approximately 80% of the way through this process, still hitting some road bumps here and there with reflection:
The type for XmlAttribute may not be specified for primitive types.
This occurs when attempting to de-serialize an enum value. The serializer seems to believe it is a string value instead.
here's the code that worked for me.
static void Main()
{
var element = XElement.Load(#"C:\Users\user\Downloads\CollectionOfObjects.xml");
ElementsToAttributes(element);
element.Save(#"C:\Users\user\Downloads\CollectionOfObjects-copy.xml");
}
static void ElementsToAttributes(XElement element)
{
foreach(var el in element.Elements().ToList())
{
if(!el.HasAttributes && !el.HasElements)
{
var attribute = new XAttribute(el.Name, el.Value);
element.Add(attribute);
el.Remove();
}
else
ElementsToAttributes(el);
}
}
The Xml in CollectionOfObjects.xml
<CollectionOfObjects>
<Name>Something</Name>
<Description>Some description.</Description>
<Object>
<Name>Name Of Object</Name>
<Description>Description of object.</Description>
<AltName>Alternate name</AltName>
<ContainerName>Container</ContainerName>
<Required>true</Required>
<Length>1</Length>
<Info>
<Name>Name</Name>
<File>Filename</File>
<Size>20</Size>
<SizeUnit>MB</SizeUnit>
</Info>
</Object>
</CollectionOfObjects>
The result Xml in CollectionOfObjects-copy.xml
<?xml version="1.0" encoding="utf-8"?>
<CollectionOfObjects Name="Something" Description="Some description.">
<Object Name="Name Of Object" Description="Description of object." AltName="Alternate name" ContainerName="Container" Required="true" Length="1">
<Info Name="Name" File="Filename" Size="20" SizeUnit="MB" />
</Object>
</CollectionOfObjects>
Related
I have a xsd file, defined by an external company, that I used with xsd.exe to generate classes. I can use a provided xml file to deserialize into an object using the generated classes just fine, but there are a few cases where I need to have smaller portions of the xml as a XDocument. I won't know the path in these portions until run time, so I'm using the xml for:
XElement element = xml.XPathSelectElement(path);
The issue I'm having is that serialized result doesn't match the incoming xml quite right, which makes the select return null. How do I get a serialized object to look like the incoming file? Did I possibly generate the classes incorrectly with xsd.exe? I'll eventually need to use the same generated code to generate my own xml files.
Here's the code I'm currently using to serialize
var xml = new XDocument();
using (var writer = xml.CreateWriter())
{
List<Type> known = new List<Type>();
known.Add(typeof(ObjType1));
...
var serializer = new DataContractSerializer(typeof(Detail), known);
serializer.WriteObject(writer, sourceDetailObj);
}
The serialized result:
<Detail xmlns:i="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://schemas.datacontract.org/2004/07/CustomNameSpace">
...
<numberField>1</numberField>
<detailTypeField>
<objField i:type="ObjType1">
<valObjField i:nil="true" />
...
</objField>
</detailTypeField>
...
</Detail>
What it should look like:
<Detail>
...
<Number>1</Number>
<DetailType>
<ObjType1>
...
</ObjType1>
</DetailType>
...
</Detail>
Here's one of the classes xsd generates:
public partial class DetailType {
private object objField;
[System.Xml.Serialization.XmlElementAttribute("ObjType1", typeof(ObjType1))]
...
public object Obj {
get {
return this.objField;
}
set {
this.objField = value;
}
}
}
Obj can be one of several classes.
The problem with using a DataContractSerializer is that it is optimized for sending messages between WCF services and won't necessarily produce the same "classic" xml that the XmlSerializer does.
In particular, XmlSerializer will serialize all public members unless you tell it not to, but for DataContractSerializer it won't serialize unless you tell it to. This was done to help make WCF faster; you only get what you ask for.
So, if you're not generating XML for WCF services, I suggest that you use the XmlSerialiser instead.
I started with three (3) XSD files provided from an external party (one XSD links to the other two). I used the xsd.exe tool to generate a .NET object by running the following command: xsd.exe mof-simpleTypes.xsd mof-isp.xsd esf-submission.xsd /c and it generated a single CS file with a handful of partial objects.
I've created an XmlSerializerNamespaces object and fill with the namespaces required (two directly used in the provided sample XML file as well as two others that don't appear to be referenced). I have successfully generated an XML file using the following method:
private XmlDocument ConvertEsfToXml(ESFSubmissionType type)
{
var xml = new XmlDocument();
var serializer = new XmlSerializer(type.GetType());
string result;
using (var writer = new Utf8StringWriter()) //override of StringWriter to force UTF-8
{
serializer.Serialize(writer, type, _namespaces); //_namespaces object holds all 4 namespaces
result = writer.ToString();
}
xml.LoadXml(result);
return xml;
}
My problem that I'm facing is in the generated CS file, one of the objects has a property (another generated partial object) that is of type XmlElement. I have successfully built the object in code, and I'm having an issue converting the object to an XmlElement. The questions and answers I have found here on SO say convert it to an XmlDocument first and then take the DocumentElement property. This works, however the returned XML has namespaces embedded in the element as follows:
<esf:ESFSubmission xmlns:isp="http://www.for.gov.bc.ca/schema/isp" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:esf="http://www.for.gov.bc.ca/schema/esf">
<esf:submissionMetadata>
<esf:emailAddress>test#test.com</esf:emailAddress>
<esf:telephoneNumber>1234567890</esf:telephoneNumber>
</esf:submissionMetadata>
<esf:submissionContent>
<isp:ISPSubmission xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:esf="http://www.for.gov.bc.ca/schema/esf" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:isp="http://www.for.gov.bc.ca/schema/isp">
<isp:ISPMillReport>
<isp:reportMonth>12</isp:reportMonth>
<isp:reportYear>2014</isp:reportYear>
<isp:reportComment>comment</isp:reportComment>
<isp:ISPLumberDetail>
<isp:species>FI</isp:species>
Note: this is just a partial of the generated XML file (for illustration purposes).
As you can see, each XML node is prefixed with the namespace variable. My question is: how can I do this in code? Is my approach sound and if so, then how do NOT include the namespaces in the ISPSubmission node OR if there is a better way to approach this problem that I overlooked, please provide insight. My desired outcome is to have all namespace definitions at the top of the document (their appropriate location) and not on the sub elements - as well as maintain the namespace variables on each element as illustrated above.
EDIT (after reggaeguitar's comment)
Here is the sample XML document I was provided
<?xml version="1.0" encoding="UTF-8"?>
<esf:ESFSubmission xmlns:esf="http://www.for.gov.bc.ca/schema/esf"
xmlns:isp="http://www.for.gov.bc.ca/schema/isp" xmlns:xlink="http://www.w3.org/1999/xlink"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.for.gov.bc.ca/schema/esf esf-submission.xsd
http://www.for.gov.bc.ca/schema/isp mof-isp.xsd">
<esf:submissionMetadata>
<esf:emailAddress>mailto:eric.murphy#cgi.com</esf:emailAddress>
<esf:telephoneNumber>6044445555</esf:telephoneNumber>
</esf:submissionMetadata>
<esf:submissionContent>
<isp:ISPSubmission>
<isp:ISPMillReport>
<isp:reportMonth>06</isp:reportMonth>
<isp:reportYear>2014</isp:reportYear>
<isp:reportComment>Up to 4000 characters is permitted for notes in this element.</isp:reportComment>
<isp:ISPLumberDetail>
<isp:species>FI</isp:species>
<isp:lumberGrade>EC</isp:lumberGrade>
<isp:gradeDescription/>
<isp:size>2x4</isp:size>
<isp:finishType/>
<isp:length>10</isp:length>
<isp:thickWidthUom>IN</isp:thickWidthUom>
<isp:volumeUnitOfMeasure>MBM</isp:volumeUnitOfMeasure>
<isp:volume>11543.987</isp:volume>
<isp:amount>1467893.98</isp:amount>
<isp:invoiceNumber>837261</isp:invoiceNumber>
</isp:ISPLumberDetail>
<isp:ISPLumberDetail>
<isp:species>CE</isp:species>
<isp:lumberGrade/>
<isp:gradeDescription/>
<isp:size/>
<isp:finishType>D</isp:finishType>
<isp:thickness>40</isp:thickness>
<isp:width>100</isp:width>
<isp:thickWidthUom>MM</isp:thickWidthUom>
<isp:volumeUnitOfMeasure>MBM</isp:volumeUnitOfMeasure>
<isp:volume>9743.987</isp:volume>
<isp:amount>1247893.98</isp:amount>
<isp:invoiceNumber/>
</isp:ISPLumberDetail>
<isp:ISPChipDetail>
<isp:species>CE</isp:species>
<isp:unitOfMeasure>BDT</isp:unitOfMeasure>
<isp:wholeLogInd>N</isp:wholeLogInd>
<isp:destinationCode>FBCO</isp:destinationCode>
<isp:destinationDescription/>
<isp:volume>563</isp:volume>
<isp:amount>54463</isp:amount>
<isp:invoiceNumber>12345679</isp:invoiceNumber>
</isp:ISPChipDetail>
</isp:ISPMillReport>
<isp:ISPSubmitter>
<isp:millNumber>103</isp:millNumber>
<isp:contactName>Dave Marotto</isp:contactName>
<isp:contactEmail>eric.murphy#cgi.com</isp:contactEmail>
<isp:contactPhone>2507775555</isp:contactPhone>
<isp:contactPhoneExtension>1234</isp:contactPhoneExtension>
</isp:ISPSubmitter>
</isp:ISPSubmission>
</esf:submissionContent>
</esf:ESFSubmission>
Solved my problem by doing the whole thing in code and not even using the xsd.exe to generate a .NET object.
I have an XSD schema already for the following xml file
<?xml version="1.0"?>
<note>
<to> </to>
<from> </from>
<datetime> </datetime>
<heading> </heading>
<body> </body>
</note>
I implemented a NoteGnerator to generate xml files based on the schema. The xml files must have to generated regarding some templates/specifications, such as:
<?xml version="1.0"?>
<note>
<to> Lucy </to>
<from> Lily </from>
<datetime> --date--time-- </datetime>
<heading> reminder </heading>
<body> do not forget my pen </body>
</note>
Another template/specification would be like:
<?xml version="1.0"?>
<note>
<to> Lily </to>
<from> Lucy </from>
<datetime> --date--time-- </datetime>
<heading> reply </heading>
<body> no problem </body>
</note>
, where <datetime> is a dynamic value when the xml is generated (so this value cannot be predetermined). Based on the XSD scheme and these two XML specifications, I can easily generate XML messages.
How can I unit test the generated XML files?
Do I need to validate the generated XML files again the schema? Or I need to use some diff tool to compare the generated xml files and the template? Because the datetime is dynamic, it is different each time when an xml file is generated, so how to compare them with the template? Or I need to deserialise xml to c# object and then test the c# object ?
This might be helpful for you. In this I am creating a object, assigning values, writing it to XML, reading the XML, and comparing it to original object. I am assuming that you have whole class structure.
// This is your expected object which you are going to write to xml.
var sourceObject = new SomeClassToWriteInXML();
// Writing object to XML.
var document = new XDocument();
var serializer = new XMLSerializer(typeof(SomeClassToWriteInXML));
using (var writer = document .CreateWriter())
{
serializer.Serialize(writer, source);
}
// write document to a file.
// Now document has the XML document.
// Need to read file you have just created. For testing sake I am reading document.
var actual = new SomeClassToWriteInXML();
// Deserialize xml to get actual object (which should be equal to sourceObject)
using (var reader = document.CreateReader())
{
actual = (SomeClassToWriteInXML)serializer.Deserialize(reader);
}
Assert.AreEqual(expected.First(), actual.First());
You can easily compare generated XML node values, except from the datetime. This is because of its non-deterministic nature. In unit testing (and code design) such problems are usually solved in either of two ways:
removing non-determinism altogether
loosening your requirements relating to non-determinism (eg. by not performing exact matching but rather some sort of fuzzy/approximated one)
With first solution, your note generating component would need to abstract out current date time to external service/dependency, say:
public class NoteGenerator
{
private readonly ICurrentDateProvider currentDateProvider;
public NoteGenerator(ICurrentDateProvider )currentDateProvider
{
this.currentDateProvider = currentDateProvider;
}
public string GenerateNote()
{
var currentDate = currentDateProvider.Now;
// ...
Now in unit test you can fake that dependency using your isolation framework of choice and perform assertions against deterministic value you set yourself (example with FakeItEasy):
var dateProvider = A.Fake<ICurrentDateProvider>();
A.CallTo(() => dateProvider.Now).Returns(new DateTime(2014, 01, 31, 10, 30));
var generator = new NoteGenerator(dateProvider);
// ...
The second approach is to replace the date time must be this value-matching with date time must not be older than-matching, for example:
var oneMinuteAgo = DateTime.Now.AddMinutes(-1.0);
var generator = new NoteGenerator();
var dateFromXml = // extract
Assert.That(dateFromXml, Is.GreaterThan(oneMinuteAgo));
Sometimes, when validating certain XML documents using an XmlValidatingReader, I receive the following error:
System.Xml.Schema.XmlSchemaValidationException:
"The 'http://www.w3.org/XML/1998/namespace:lang' attribute is not declared."
The same document sometimes succeeds. I cannot figure out why.
My XSD imports the schema like so:
<xs:schema id="myschemaId"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
targetNamespace="http://mytargetnamespace.com"
xmlns="http://mytargetnamespace.com"
xmlns:mm="http://mytargetnamespace.com"
elementFormDefault="qualified">
<xs:import namespace="http://www.w3.org/XML/1998/namespace"
schemaLocation="http://www.w3.org/2001/xml.xsd" />
...
And in the XML document I have the following attributes:
<root xmlns="http://mytargetnamespace.com"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://mytargetnamespace.com myschema.xsd">
Finally, the XmlReaderSettings:
const XmlSchemaValidationFlags validationFlags =
XmlSchemaValidationFlags.ProcessInlineSchema |
XmlSchemaValidationFlags.ProcessSchemaLocation |
XmlSchemaValidationFlags.ReportValidationWarnings |
XmlSchemaValidationFlags.AllowXmlAttributes;
// Set the validation settings.
var settings = new XmlReaderSettings
{
ValidationType = ValidationType.Schema,
ValidationFlags = validationFlags,
DtdProcessing = DtdProcessing.Parse
};
settings.ValidationEventHandler += OnValidationEventHandler;
// Create the XmlReader object.
var reader = XmlReader.Create(_xmlFilePath, settings);
// Parse the file.
while (reader.Read()) {}
This is a standalone exe running .NET 4.0 on Windows 2003.
I've noticed that there's a significant pause when it's trying to validate. Could that be related? Is it trying to download the actual "xml.xsd" schema and not succeeding?
Because many of the DTDs and XSDs originated from the W3C, they have the problem that many people try to resolve them from their servers, resulting in their being inundated with requests - millions and millions of them. So they started blocking "excessive" requests.
See this blog entry, which also applies to XSDs.
The solution is to use a local copy.
I'm pretty confident I've solved this one. I checked Fiddler and did see requests going out to w3c.org for the xsd file. A little more research turned up this link; remark #3 seemed to relate to my situation. So if for whatever reason my machine couldn't download the XSD file, then the xml namespace became unavailable. Sadly the real error ("could not reach w3c.org" or what have you) was never reported.
Removing the schemaLocation from the xs:import did the trick.
I have a rather detailed xml file. Below is the top level nodes (I have included the ellipse as the lower level nodes are all well formed and properly filled with data):
<?xml version="1.0" encoding="UTF-8"?>
<config>
<Models>...</Models>
<Data>...</Data>
</config>
I have created an xsd file from using the Visual Studio 2008 command prompt:
xsd sample.xml
This generates the xsd file just fine. I then auto generate classes from the xsd with the command:
xsd sample.xsd /classes
For the deserialization of the xml file into a class object, I'm using the read function in the helper class:
public class XmlSerializerHelper<T>
{
public Type _type;
public XmlSerializerHelper()
{
_type = typeof(T);
}
public void Save(string path, object obj)
{
using (TextWriter textWriter = new StreamWriter(path))
{
XmlSerializer serializer = new XmlSerializer(_type);
serializer.Serialize(textWriter, obj);
}
}
public T Read(string path)
{
T result;
using (TextReader textReader = new StreamReader(path))
{
XmlSerializer deserializer = new XmlSerializer(_type);
result = (T)deserializer.Deserialize(textReader);
}
return result;
}
}
When attempting the deserialization with:
var helper = new XmlSerializerHelper<configModels>();
var obj = new configModels();
obj = helper.Read(filepath);
I receive an error that I have deduced is because the deserializer is looking for the 'Models' node but the corresponding class name was generated as a combination of the root node and the 'Model' node (configModels). Why are the class names generated like this?
I tried to deserialize from the top node using:
var helper = new XmlSerializerHelper<config>();
var obj = new config();
obj = helper.Read(filepath);
Unfortunately, this the results in a slew of errors like the following:
System.InvalidOperationException was unhandled by user code
Message="Unable to generate a temporary class (result=1).
error CS0030: Cannot convert type 'Application.Lease[]' to 'Application.Lease'
error CS0030: Cannot convert type 'Application.CashFlow[]' to 'Application.CashFlow'
...ect.
Can somebody steer me towards what I might be doing wrong with my xsd auto-generating?
XSD.EXE is a good start - but it's far from perfect. Also, based on the XML you provided, XSD.EXE can't always decide for sure whether something is a single instance of an object, or an open-ended array of objects.
This seems to be the case for your two elements - Application.Lease and Application.CashFlow. How are they defined in the generated XSD file? Does that make sense to you? Quite possibly, you'd have to add a little hints, such as:
<xs:element name="Lease" minOccurs="0" maxOccurs="1" />
for an optional property, that's zero or one occurences only. Things like that are really hard for the xsd.exe tool to figure out based on just a single XML sample file.
Marc
Go to your generated class and change all from [][] ---> []
There's an issue with xsd.exe and lists. You have to go into the generated class and manually edit the file to the correct type. I've switched to using Xsd2Code. So far it doesn't seem to have this problem.
Another issue that can cause this problem is that the xml file contents between the tags (meaning the content) is still encoded when it shouldn't be. For example, the <br> tags in my content were still <br> instead of <br />. The xsd generator turned these into elements in the schema then mislabeled them as unbounded since there was more than one found. Unencoding them fixed the problem and generated the classes correctly.