I have a big (~40mb) collection of XML data, split in many files which are not well formed, so i merge them, add a root node and load all the xml in a XmlDocument. Its basically a list of 3 different types which can be nested in a few different ways. This example should show most of the cases:
<Root>
<A>
<A>
<A></A>
<A></A>
</A>
</A>
<A />
<B>
<A>
<A>
<A></A>
<A></A>
</A>
</A>
</B>
<C />
</Root>
Im separating all A, B and C nodes by using XPath expressions on a XmlDocument (//A, //B, //C), convert the resulting nodesets to a datatable and show a list of all nodes of each nodetype separately in a Datagridview. This works fine.
But now Im facing an even bigger file and as soon as i load it, it shows me only 4 rows. Then i added a breakpoint at the line where the actual XmlDocument.SelectNodes happens and checked the resulting NodeSet. It shows me about 25,000 entries. After continuing the program loaded and whoops, all my 25k rows were shown. I tried it again and i can reproduce it. If i step over XmlDocument.SelectNodes by hand, it works. If i dont break there, it does not. Im not spawning a single thread in my application.
How can i debug this any further? What to look for? I have experienced such behaviour with multithreaded libraries such as jsch (ssh) but im dont see why this should happen in my case.
Thank you very much!
// class XmlToDataTable:
private DataTable CreateTable(NamedXPath logType,
List<XmlColumn> columns,
ITableCreator tableCreator)
{
// I have to break here -->
XmlNodeList xmlNodeList = logFile.GetEntries(logType);
// <-- I have to break here
DataTable dataTable = tableCreator.CreateTableLayout(columns);
foreach (XmlNode xmlNode in xmlNodeList)
{
DataRow row = dataTable.NewRow();
tableCreator.PopulateRow(xmlNode, row, columns);
dataTable.Rows.Add(row);
}
return dataTable;
}
// class Logfile:
public XmlNodeList GetEntries(NamedXPath e)
{
return (_xmlDocument != null && _xmlDocument.HasChildNodes)
? _xmlDocument.SelectNodes(e.XPath)
: new XmlNullObjectNodeList();
}
// _xmlDocument gets loaded here after reading all xml fragments into a string
// (ugly, i know. the // ugly! comment reminds me about that ;))
private void CreateXmlDoc()
{
_xmlDocument = new XmlDocument();
_xmlDocument.LoadXml(OPEN_ROOT_ELEMENT + _xmlString +
CLOSE_ROOT_ELEMENT);
if (DataChanged != null)
DataChanged(this, new EventArgs());
}
// class NamedXPath:
public abstract class NamedXPath
{
private readonly String _name;
private readonly String _xPath;
protected NamedXPath(string name, string xPath)
{
_name = name;
_xPath = xPath;
}
public string Name
{
get { return _name; }
}
public string XPath
{
get { return _xPath; }
}
}
Instead of using XPath directly in the code first, I would use a tool such as sketchPath to get my XPath right. You can either load your original XML or use subset of original XML.
Play with XPath and your XML to see if the expected nodes are getting selected before using xpath in your code.
Okay, solved it. tableCreator is part of my strategy pattern, which influences the way the table is built. In a certain implementation I do something like this:
XmlNode xn = xmlDocument.SelectSingleNode(fancyXPath);
// if a node has ancestors, then its a linked list:
// <a><a><a></a></a></a>
if(xn.SelectSingleNode("a") != null)
xn.SelectSingleNode("a").InnerText = "<IDs of linked list items CSV like here>";
Which means im replacing parts of a xml linked list with some text and lose the nested items there.
Wouldn't be a problem to find this bug if this change wouldn't affect the original XmlDocument. Even then, debugging it should not be too hard. What makes my program behaving differently depending whether I break or not seems to be the following:
Return Value:
The first XmlNode that
matches the XPath query or null if no
matching node is found. The XmlNode
should not be expected to be connected
"live" to the XML document. That is,
changes that appear in the XML
document may not appear in the
XmlNode, and vice versa. (API
Description of XmlNode.SelectNodes())
If I break there, the changes are written back to the original XmlDocument, if I don't break, its not written back. Can't really explain that to myself, but without the change in the XmlNode everything works.
edit:
Now im quite sure: I had XmlNodeList.Count in my watches. This means, everytime i debugged, VS called the property Count, which not only returns a number but calls ReadUntil(int), which refreshes the internal list:
internal int ReadUntil(int index)
{
int count = this.list.Count;
while (!this.done && (count <= index))
{
if (this.nodeIterator.MoveNext())
{
XmlNode item = this.GetNode(this.nodeIterator.Current);
if (item != null)
{
this.list.Add(item);
count++;
}
}
else
{
this.done = true;
return count;
}
}
return count;
}
This may have caused that weird behavior.
Related
I am trying to pull out data from an XML document that seems to use relative references like this:
<action>
<topic reference="../../action[110]/topic"/>
<context reference="../../../../../../../../../../../../../contexts/items/context[2]"/>
</action>
Two questions:
Is this normal or common?
Is there a way to handle this with linq to XML / XDocument or would I need to manually traverse the document tree?
Edit:
To clarify, the references are to other nodes within the same XML document. The context node above references a list of contexts, and says to get the one at index 2.
The topic node worries me more because it's referencing a certain other action's topic, which could in turn reference a list of topics. If that wasn't happening I would have just loaded the lists of contexts and topics in a cache and looked them up that way.
You can use XPATH Query to extract the nodes and it is very efficient.
Step1: Load the XML into XMLDocument
Step2: use node.SelectNodes("//*[reference]")
Step3: After that you can loop through the XML nodes.
I ended up manually traversing the tree. But with extension methods it's all nice and out of the way. In case it might help anyone in the future, this is what I threw together for my use-case:
public static XElement GetRelativeNode(this XAttribute attribute)
{
return attribute.Parent.GetRelativeNode(attribute.Value);
}
public static string GetRelativeNode(this XElement node, string pathReference)
{
if (!pathReference.Contains("..")) return node; // Not relative reference
var parts = pathReference.Split(new string[] { "/"}, StringSplitOptions.RemoveEmptyEntries);
XElement current = node;
foreach (var part in parts)
{
if (string.IsNullOrEmpty(part)) continue;
if (part == "..")
{
current = current.Parent;
}
else
{
if (part.Contains("["))
{
var opening = part.IndexOf("[");
var targetNodeName = part.Substring(0, opening);
var ending = part.IndexOf("]");
var nodeIndex = int.Parse(part.Substring(opening + 1, ending - opening - 1));
current = current.Descendants(targetNodeName).Skip(nodeIndex-1).First();
}
else
{
current = current.Element(part);
}
}
}
return current;
}
And then you'd use it like this (item is an XElement):
item.Element("topic").Attribute("reference").GetRelativeNode().Value
What I am trying to do is create ideally a nested List basically a 2d list, or a 2D array if that is better for this task, that would work as follows ID => 1 Name => Hickory without explicitly selecting the node.
I could use SelectNode (Woods/Wood) and then do something like node["ID"].InnerText but that would require that I know what the nodes name is.
Assume that this would read wood.xml even if there were 36 nodes instead of 7 and that I will never know the name of the nodes. I tried using outerxml/innerxml but that gives me too much information.
XmlDocument doc = new XmlDocument();
doc.Load("wood.xml");
//Here is wood.xml
/*<Woods><Wood><ID>1</ID><Name>Hickory</Name><Weight>3</Weight><Thickness>4</Thickness><Density>5</Density><Purity>6</Purity><Age>7</Age></Wood><Wood><ID>2</ID><Name>Soft Maple</Name><Weight>3</Weight><Thickness>4</Thickness><Density>5</Density><Purity>6</Purity><Age>7</Age></Wood><Wood><ID>3</ID><Name>Red Oak</Name><Weight>3</Weight><Thickness>4</Thickness><Density>5</Density><Purity>6</Purity><Age>7</Age></Wood></Woods>*/
XmlNode root = doc.FirstChild;
//Display the contents of the child nodes.
if (root.HasChildNodes)
{
for (int i=0; i<root.ChildNodes.Count; i++)
{
Console.WriteLine(root.ChildNodes[i].InnerXml);
Console.WriteLine();
}
Console.ReadKey();
}
That would allow me to basically create a wood "buffer" if you will so I can access these values elsewhere.
Sorry if I was unclear I want to essentially make this "abstract" for lack of a better word.
So that if I were someday to change the name of "Weight" to "HowHeavy" or if i were to add an additional element "NumberOfBranches" I would not have to hardcode the structure of the xml file.
Is this what you after ?
class Program
{
static void Main(string[] args)
{
string xml = #"<Woods><Wood><ID>1</ID><Name>Hickory</Name><Weight>3</Weight><Thickness>4</Thickness><Density>5</Density><Purity>6</Purity><Age>7</Age></Wood><Wood><ID>2</ID><Name>Soft Maple</Name><Weight>3</Weight><Thickness>4</Thickness><Density>5</Density><Purity>6</Purity><Age>7</Age></Wood><Wood><ID>3</ID><Name>Red Oak</Name><Weight>3</Weight><Thickness>4</Thickness><Density>5</Density><Purity>6</Purity><Age>7</Age></Wood></Woods>";
XDocument doc = XDocument.Parse(xml);
//Get your wood nodes and values in a list
List<Tuple<string,string>> list = doc.Descendants().Select(a=> new Tuple<string,string>(a.Name.LocalName,a.Value)).ToList();
// display the list
list.All(a => { Console.WriteLine(string.Format("Node name {0} , Node Value {1}", a.Item1, a.Item2)); return true; });
Console.Read();
}
}
You can use xmlDocument.SelectNodes("//child::node()")
When I load this XML node, the HTML within the node is being completely stripped out.
This is the code I use to get the value within the node, which is text combined with HTML:
var stuff = innerXml.Descendants("root").Elements("details").FirstOrDefault().Value;
Inside the "details" node is text that looks like this:
"This is <strong>test copy</strong>. This is A Link"
When I look in "stuff" var I see this:
"This is test copy. This is A Link". There is no HTML in the output... it is pulled out.
Maybe Value should be innerXml or innerHtml? Does FirstOrDefault() have anything to do with this?
I don't think the xml needs a "cdata" block...
HEre is a more complete code snippet:
announcements =
from link in xdoc.Descendants(textContainer).Elements(textElement)
where link.Parent.Attribute("id").Value == Announcement.NodeId
select new AnnouncmentXml
{
NodeId = link.Attribute("id").Value,
InnerXml = link.Value
};
XDocument innerXml;
innerXml = XDocument.Parse(item.InnerXml);
var abstract = innerXml.Descendants("root").Elements("abstract").FirstOrDefault().Value;
Finally, here is a snippet of the Xml Node. Notice how there is "InnerXml" within the standard xml structure. It starts with . I call this the "InnerXml" and this is what I am passing into the XDocument called InnerXml:
<text id="T_403080"><root> <title>How do I do stuff?</title> <details> Look Here Some Form. Please note that lorem ipsum dlor sit amet.</details> </root></text>
[UPDATE]
I tried to use this helper lamda, and it will return the HTML but it is escaped, so when it displays on the page I see the actual HTML in the view (it shows instead of giving a link, the tag is printed to screen:
Title = innerXml.Descendants("root").Elements("title").FirstOrDefault().Nodes().Aggregate(new System.Text.StringBuilder(), (sb, node) => sb.Append(node.ToString()), sb => sb.ToString());
So I tried both HTMLEncode and HTMLDecode but neither helped. One showed the escaped chars on the screen and the other did nothing:
Title =
System.Web.HttpContext.Current.Server.HtmlDecode(
innerXml.Descendants("root").Elements("details").Nodes().Aggregate(new System.Text.StringBuilder(), (sb, node) => sb.Append(node.ToString()), sb => sb.ToString())
);
I ended up using an XmlDocument instead of an XDocument. It doesn't seem like LINQ to XML is mature enough to support what I am trying to do. THere is no InnerXml property of an XDoc, only Value.
Maybe someday I will be able to revert to LINQ. For now, I just had to get this off my plate. Here is my solution:
// XmlDoc to hold custom Xml within each node
XmlDocument innerXml = new XmlDocument();
try
{
// Parse inner xml of each item and create objects
foreach (var faq in faqs)
{
innerXml.LoadXml(faq.InnerXml);
FAQ oFaq = new FAQ();
#region Fields
// Get Title value if node exists and is not null
if (innerXml.SelectSingleNode("root/title") != null)
{
oFaq.Title = innerXml.SelectSingleNode("root/title").InnerXml;
}
// Get Details value if node exists and is not null
if (innerXml.SelectSingleNode("root/details") != null)
{
oFaq.Description = innerXml.SelectSingleNode("root/details").InnerXml;
}
#endregion
result.Add(oFaq);
}
}
catch (Exception ex)
{
// Handle Exception
}
I do think wrapping your details node in a cdata block is the right decision. CData basically indicates that the information contained within it should be treated as text, and not parsed for XML special characters. The html charaters in the details node, especially the < and > are in direct conflict with the XML spec, and should really be marked as text.
You might be able to hack around this by grabbing the innerXml, but if you have control over the document content, cdata is the correct decision.
In case you need an example of how that should look, here's a modified version of the detail node:
<details>
<![CDATA[
This is <strong>test copy</strong>. This is A Link
]]>
</details>
Hy,
I have for example this xml:
<books>
<book1 name="Cosmic">
<attribute value="good"/>
</book1>
</books>
How can I display it in a listBox control line by line, that the final result it will be a listbox with 5 rows in this case?
In this moment I am prasing the XML using LINQ to XML like this:
foreach (XElement element in document.DescendantNodes())
{
MyListBox.Items.Add(element.ToString());
}
But the final result puts every xml node in one list-box item (including child-nodes).
Does anyone has any idea how can I put the xml line by line in list-box items?
Thanks.
Jeff
A simple solution would use a recursive function like the following:
public void FillListBox(ListBox listBox, XElement xml)
{
listBox.Items.Add("<" + xml.Name + ">");
foreach (XNode node in xml.Nodes())
{
if (node is XElement)
// sub-tag
FillListBox(listBox, (XElement) node);
else
// piece of text
listBox.Items.Add(node.ToString());
}
listBox.Items.Add("</" + xml.Name + ">");
}
Of course, this one will print only the tag names (e.g. <book1> in your example) and not the attributes (name="Cosmic" etc.). I’m sure you can put those in yourself.
If you want to display your raw XML in a list box, use a text stream to read in your data.
using(StreamReader re = File.OpenText("Somefile.XML"))
{
string input = null;
while ((input = re.ReadLine()) != null)
{
MyListBox.Items.Add(input);
}
}
Jeff, maybe it would be much easier to implement (and to read/maintain) with a simple TextReader.ReadLine()?
I don't know what you are trying to achieve, just a suggestion.
I'm new to C#, and just started using XmlElement and its SelectSingleNode method. In my XML file there's a tag that may have a value (i.e. <tag>value</tag>) or be empty (i.e. <tag></tag>). If it's empty, SelectSingleNode returns null.
I'm currently using the following code to catch the value of the tag:
XmlElement elem = ....
string s = elem.SelectSingleNode("somepath").Value;
This code obviously raises an exception for empty tags. However, for me an empty tag is a valid value, where I expect the value of my string to be "".
Wrapping each call to SelectSingleNode with try...catch seems a huge waste of code (I have many fields that may be empty), and I'm sure there's a better way to achieve this.
What is the recommended approach?
EDIT:
Following requests, a sample XML code will be:
<Elements>
<Element>
<Name>Value</Name>
<Type>Value</Type> <-- may be empty
<Color>Value</Color>
</Element>
<Element>
<Name>Value</Name>
<Type>Value</Type>
<Color>Value</Color>
</Element>
</Elements>
The CS code:
XmlDocument doc = new XmlDocument();
doc.Load("name.xml");
foreach (XmlElement elem in doc.SelectNodes("Elements/Element"))
{
myvalue = elem.SelectSingleNode("Type/text()").Value;
}
Your sample code:
myvalue = elem.SelectSingleNode("Type/text()").Value;
is where the problem is. The XPath expression you've used there doesn't mean "give me text of element Type". It means "give me all child text nodes of element Type". And an empty element doesn't have any child text nodes (a text node cannot be empty in XPath document model). If you want to get text value of the node, you should use:
myvalue = elem.SelectSingleNode("Type").InnerText;
The recommended approach would be to use .NET's new XML API (namely LINQ to XML).
Here is an example:
using System;
using System.Linq;
using System.Xml.Linq;
class Program
{
static void Main()
{
String xml = #"<Root><Value></Value></Root>";
var elements = XDocument.Parse(xml)
.Descendants("Value")
.Select(e => e.Value);
}
}
http://msdn.microsoft.com/en-us/library/system.xml.xmlnode.value(VS.71).aspx
Because the "value" returned depends on the NodeType, there is a chance that the node will be interpreted as a type that can return NULL.
You might be better off using:
XmlElement elem = ....
string s = elem.SelectSingleNode("somepath").InnerText;
as XMLNode.InnerText (or XmlNode.InnerXML) will return a string, including an empty string.
Maybe this will work for you:
string s = elem.SelectSingleNode("somepath") != null ? elem.SelectSingleNode("somepath").value : ""
When I'm actually bothering with XML DOM, you could write a helper method along the lines of:
static string NodeValue(XmlNode node, string defaultValue)
{
if (node != null)
return node.Value ?? defaultValue;
return defaultValue;
}
Then you can do the following if you're not sure your node will exist:
string s = NodeValue(elem.SelectSingleNode("Type"), String.Empty);
If keeps your code readable, especially if you're doing this for multiple elements.
All that being said, SelectSingleNode(..) does not return a null value if the tag is empty. The Value attribute will be null however. If you're just trying to work around that, this should do:
string s = elem.SelectSingleNode("Type").Value ?? String.Empty;
Edit: ah, you're using /text() to select the actual text node. You could just get rid of that part of the XPath, but the NodeValue method I supplied should still work (the "?? defaultValue" part is not needed in that case though).