Mysterious failure removing nodes from an XML document

Mysterious failure removing nodes from an XML document - c#

I'd be surprised if anyone can explain this, but it'd be interesting to know if others can reproduce the weirdness I'm experiencing...
We've got a thing based on InfoPath that processes a lot of forms. Form data should conform to an XSD, but InfoPath keeps adding its own metadata in the form of so-called "my-fields". We would like to remove the my-fields, and I wrote this simple method:
string StripMyFields(string xml)
{
var doc = new XmlDocument();
doc.LoadXml(xml);
var matches = doc.SelectNodes("//node()").Cast<XmlNode>().Where(n => n.NamespaceURI.StartsWith("http://schemas.microsoft.com/office/infopath/"));
Dbug("Found {0} nodes to remove.", matches.Count());
foreach (var m in matches)
m.ParentNode.RemoveChild(m);
return doc.OuterXml;
}
Now comes the really weird stuff! When I run this code it behaves as I expect it to, removing any nodes that are in InfoPath namespaces. However, if I comment out the call to Dbug, the code completes, but one "my-field" remains in the XML.
I've even commented out the content of the convenient Dbug method, and it still behaves this same way:
void Dbug(string s, params object[] args)
{
//if (args.Length > 0)
// s = string.Format(s, args);
//Debug.WriteLine(s);
}
Input XML:
<?xml version="1.0" encoding="UTF-8"?>
<skjema xmlns:my="http://schemas.microsoft.com/office/infopath/2003/myXSD/2008-03-03T22:25:25" xml:lang="en-us">
<Field-1643 orid="1643">data.</Field-1643>
<my:myFields>
<my:field1>Al</my:field1>
<my:group1>
<my:group2>
<my:field2 xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">2009-01-01</my:field2>
<Field-1611 orid="1611">More data.</Field-1611>
<my:field3>true</my:field3>
</my:group2>
<my:group2>
<my:field2>2009-01-31</my:field2>
<my:field3>false</my:field3>
</my:group2>
</my:group1>
</my:myFields>
<Field-1612 orid="1612">Even more data.</Field-1612>
<my:field3>Blah blah</my:field3>
</skjema>
The "my:field3" element (at the bottom, text "Blah blah") is not removed unless I invoke Dbug.
Clearly the universe is not supposed to be like this, but I would be interested to know if others are able to reproduce.
I'm using VS2012 Premium (11.0.50727.1 RTMREL) and FW 4.5.50709 on Win8 Enterprise 6.2.9200.

First things first. LINQ uses concept known as deferred execution. This means no results are fetched until you actually materialize query (for example via enumeration).
Why would it matter with your nodes removal issue? Let's see what happens in your code:
SelectNodes creates XPathNodeIterator, which is used by XPathNavigator which feeds data to XmlNodeList returned by SelectNodes
XPathNodeIterator walks xml document tree basing on XPath expression provided
The Cast and Where simply decide whether node returned by XPathNodeIterator should participate in final result
We arrive right before DBug method call. For a moment, assume it's not there. At this point, nothing have actually happened just yet. We only got unmaterialized LINQ query.
Things change when we start iterating. All the iterators (Cast and Where got their own iterators too) start rolling. WhereIterator asks CastIterator for item, which then asks XPathNodeIterator which finally returns first node (Field-1643). Unfortunately, this one fails the Where test, so we ask for next one. More luck with my:myFields, it is a match - we remove it.
We quickly proceed to my:field1 (again, WhereIterator → CastIterator → XPathNodeIterator), which is also removed. Stop here for a moment. Removing my:field1 detaches it from its parent, which results in setting its (my:field1) siblings to null (there's no other nodes before/after removed node).
What's the current state of things? XPathNodeIterator knows its current element is my:field1 node, which just got removed. Removed as in detached from parent, but iterator still holds reference. Sounds great, let's ask it for next node. What XPathNodeIterator does? Checks its Current item, and asks for NextSibling (since it has no children to walk first) - which is null, given we just performed detachment. And this means iteration is over. Job done.
As a result, by altering collection structure during iteration, you only removed two nodes from your document (while in reality only one, as the second removed node was child of the one already removed).
Same behavior can be observed with much simpler XML:
<Root>
<James>Bond</James>
<Jason>Bourne</Jason>
<Jimmy>Keen</Jimmy>
<Tom />
<Bob />
</Root>
Suppose we want to get rid of nodes starting with J, resulting in document containing only honest man names:
var doc = new XmlDocument();
doc.LoadXml(xml);
var matches = doc
.SelectNodes("//node()")
.Cast<XmlNode>()
.Where(n => n.Name.StartsWith("J"));
foreach (var node in matches)
{
node.ParentNode.RemoveChild(node);
}
Console.WriteLine(doc.InnerXml);
Unfortunately, Jason and Jimmy remain. James' next sibling (the one to be returned by iterator) was originally meant to be Jason, but as soon as we detached James from tree there's no siblings and iteration ends.
Now, why it works with DBug? Count call materializes query. Iterators have run, we got access to all nodes we need when we start looping. The same things happens with ToList called right after Where or if you inspect results during debug (VS even notifies you inspecting results will enumerate collection).

I think this is down to the schrodinger's cat problem that Where will not actually compile the results of the query until you view or act upon it. Meaning, until you call Count() (or any other function for getting the results) or view it in debugger, the results don't exist. As a test, try put it as such:
if (matches.Any())
foreach (var m in matches)
m.ParentNode.RemoveChild(m);

Very strange, its only when you actually view the results while debugging that it removes the last node. Incidentally, converting the result to a List and then looping through it also works.
List<XmlNode> matches = doc.SelectNodes("//node()").Cast<XmlNode>().Where(n => n.NamespaceURI.StartsWith("http://schemas.microsoft.com/office/infopath/")).ToList();
foreach (var m in matches)
{
m.ParentNode.RemoveChild(m);
}

jimmy_keen's solution worked for me. I had just a simple
//d is an XmlDocument
XmlNodeList t = d.SelectNodes(xpath);
foreach (XmlNode x in t)
{
x.ParentNode.RemoveChild(x);
}
d.Save(outputpath);
this would remove only 3 nodes while stepping through in debug mode would remove 1000+ nodes.
Just adding a Count before the foreach solved the problem:
var count = t.Count;

Related

Why does my C# Xml code only work when I enumerate variable enumerable

I am working on code that formats an XML file so that subfolder nodes are actually nested within their parent node. The source XML has every folder as a separate childnode in the root instead of the proper tree as you would expect it to be with subfolders in their main folders. The piece of code this question is about:
// Load original XML
string sFile = "PathFile";
XmlDocument doc = new XmlDocument();
doc.Load(sFile);
var n = doc.DocumentElement.SelectNodes ("//*"); // Load all nodes into nodelist n
// int nCount = n.Count; // If uncommented code works
foreach(XmlNode x in n)
{ rest of the code }
Now I have the code working properly, but only sometimes, even without changing anything in between runs. I have narrowed it down to this: When debugging the code in Visual Studio it goes wrong if I just run the code from beginning to end. If I break halfway and take a look at the attributes in the XmlNodelist n (by hovering over it with the cursor and seeing the element count) it does work.
After discovering this I added the
int nCount = n.Count;
line and now the code works when running unsupervised from start to finish.
What is happening here and what is the correct way to address this issue? Note: doc.LoadXml does not work with this particular file.
Thank you loads,
Thomas

The short answer: Because of side-effects in the implementation of XmlNodeList.
XmlNode.SelectNodes() returns an XmlNodeList (technically, an XPathNodeList), which is a "live list" of the nodes matching the selection (in this case an XPath selection).
As you iterate the XPathNodeList or access it in other ways, it makes its way through the matching nodes, building up an internal list as needed, and returning them as needed.
So if you try to rearrange the document as you are iterating through the nodes, this can foul up the iteration and cause it to stop before you have gone through all of them. The iteration is basically chasing a moving target as the document shifts underneath it.
However, in order to return a value for the Count property, the XPathNodeList basically needs to find every matching node and count them, so it goes through the entire set of matches and places them all in the internal list.
public override int Count {
get {
if (! done) {
ReadUntil(Int32.MaxValue);
}
return list.Count;
}
}
I think this explains what you are seeing. When you access the Count property before making changes, it builds up the entire list of nodes, as a side-effect, so that list is still populated when you actually iterate through them.
Of course, it would not be wise to rely on this undocumented behavior.
Instead, I advise that you actually copy the contents of the XmlNodeList to a list of your own, and then iterate over that:
string sFile = "PathFile";
XmlDocument doc = new XmlDocument();
doc.Load(sFile);
var allNodes = doc.DocumentElement
.SelectNodes("//*")
.OfType<XmlNode>() // using System.Linq;
.ToList();
foreach (XmlNode x in allNodes)
{
// rest of the code
}

Use Parallel.ForEach with XmlNodeList in C#

I already done basicforeach loop with XmlNodeList as given bellow.
Sample XML File (books.xml)
XmlDocument doc = new XmlDocument();
doc.Load("books.xml");
XmlNodeList xnList = doc.SelectNodes("catalog/book");
foreach (XmlNode node in xnList)
{
Console.WriteLine(node["author"].InnerText);
}
How do I convert this loop into Parallel.ForEach ?
I've try with this code.But it's didn't work.
Parallel.ForEach(xnList, (XmlNode node) =>
{
Console.WriteLine(node["author"].InnerText);
});
It's says Error 2
Argument 1: cannot convert from System.Xml.XmlNodeList to
System.Collections.Generic.IEnumerable<System.Xml.XmlNode>

XmlNodeList implements the non-generic IEnumerable. You'll need to cast it first in order to work with an IEnumerable<XmlNode>, as that is what Parallel.ForEach operates on:
Parallel.ForEach(xnList.Cast<XmlNode>(), (XmlNode node) =>
{
Console.WriteLine(node["author"].InnerText);
});

Another tip, you can set how many parallel processes you want:
Parallel.ForEach(nodes.Cast<XmlNode>(), new ParallelOptions { MaxDegreeOfParallelism =
Environment.ProcessorCount },
(XmlNode node) =>
{
string value = node.InnerText;
//Some other task
});
In above Envorinment.ProcessorCount refers to the number of logical cores the CPU reports to Windows. I strongly suggest keeping below this number. I have i9 28-cores, and set to 28, pretty much locks up your machine (just leave 1-2 free).
Another use for this is that when you are debugging, it is hard to track the multi-threads breaking at the same location when you have many threads running. Setting this to 1 will make it work like a regular Loop.
Be mindful of impacts of this. If you are interacting with public items, remember you will have 20+ parallels calls in one hit. I have found that trying to add unique Dictionary terms for some bizarre reason tells me there are duplicates when running multithread.. works fine with just 1.

Parsing xml file that comes in as one object per line

I haven't been here in so long, I forgot my prior account! Anyways, I am working on parsing an xml document that comes in ugly. It is for banking statements. Each line is a <statement>all tags</statement>. Now, what I need to do is read this file in, and parse the XML document at the same time, while formatting it more human readable too. Point beeing,
Original input looks like this:
<statement><accountHeader><fiAddress></fiAddress><accountNumber></accountNumber><startDate>20140101</startDate><endDate>20140228</endDate><statementGroup>1</statementGroup><sortOption>0</sortOption><memberBranchCode>1</memberBranchCode><memberName></memberName><jointOwner1Name></jointOwner1Name><jointOwner2Name></jointOwner2Name></summary></statement>
<statement><accountHeader><fiAddress></fiAddress><accountNumber></accountNumber><startDate>20140101</startDate><endDate>20140228</endDate><statementGroup>1</statementGroup><sortOption>0</sortOption><memberBranchCode>1</memberBranchCode><memberName></memberName><jointOwner1Name></jointOwner1Name><jointOwner2Name></jointOwner2Name></summary></statement>
<statement><accountHeader><fiAddress></fiAddress><accountNumber></accountNumber><startDate>20140101</startDate><endDate>20140228</endDate><statementGroup>1</statementGroup><sortOption>0</sortOption><memberBranchCode>1</memberBranchCode><memberName></memberName><jointOwner1Name></jointOwner1Name><jointOwner2Name></jointOwner2Name></summary></statement>
I need the final output to be as follows:
<statement>
<name></name>
<address></address>
</statement>
This is fine and dandy. I am using the following "very slow considering 5.1 million lines, 254k data file, and about 60k statements takes around 8 minutes".
foreach(String item in lines)
{
XElement xElement = XElement.Parse(item);
sr.WriteLine(xElement.ToString().Trim());
}
Then when the file is formatted this is what sucks. I need to check every single tag in transaction elements, and if a tag is missing that could be there, I have to fill it in. Our designer software will default prior values in if a tag is possible, and the current objects does not have. It defaults in the value of a prior one that was not Null. "I know, and they swear up and down it is not a bug... ok?"
So, that is also taking about 5 to 10 minutes. I need to break all this down, and find a faster method for working with the initial XML. This is a preprocess action, and cannot take that long if not necessary. It just seems redundant.
Is there a better way to parse the XML, or is this the best I can do? I parse the XML, write to a temp file, and then read that file in, to the output file inserting the missing tags. 2 IO runs for one process. Yuck.

You can start by trying a modified for loop to see if this speeds it up for you:
XElement root = new XElement("Statements");
foreach(String item in lines)
{
XElement xElement = XElement.Parse(item);
root.Add(xElement);
}
sr.WriteLine(root.ToString().Trim());
Well, I'm not sure if this will help with memory issues. If it works, you'll get multiple xml files.
int fileCount=1;
int count = 0;
XElement root;
Action Save = () => root.Save(string.Format("statements{0}.xml",fileCount++));
while(count < lines.Length) // or lines.Count
try
{
root = new XElement("Statements");
foreach(String item in lines.Skip(count))
{
XElement xElement = XElement.Parse(item);
root.Add(xElement);
count++;
}
Save();
}
catch (OutOfMemoryException)
{
Save();
root = null;
GC.Collect();
}

xmllint file-as-one-line --format > output.xml

XPathSelectElement is very slow; is there a better way to get a value given an XPath?

I have a file that essentially a list of XPaths like so:
/Options/File[1]/Settings[1]/Type[1]
/Options/File[1]/Settings[1]/Path[1]
/Options/File[1]/Settings[2]/Type[1]
/Options/File[1]/Settings[2]/Path[1]
/Options/File[2]/Settings[1]/Type[1]
/Options/File[2]/Settings[1]/Path[1]
I need to grab the values from the elements pointed to from these XPaths in moderate sized XML file (~3-5MB). Using XPathSelectElement works well, but is extremely slow. Is there a quicker way to do the same with Linq to XML or even manually traversing the XML?
In a related question, is the index value in the XPath and the order of elements returned from an XElement guarenteed to be the same? For instance, will these return the same:
xdoc.XPathSelectElement("/Options/File[1]/Settings[2]);
xdoc.root.Elements("File").ElementAt(0).Elements("Settings").ElementAt(1);

Indexed XPath (n-th child) is normally slow due to need to traverse all children up to the one you need. To check - for relatively large file try to pick first child and last child and compare the differences (repeat ~1000 times for each and use StopWatch to measure).
If you XPath's all like you've shown you may be able to do selection manually by caching the child nodes as you iterate.
Order of elements in XML is significant, so normal XML API will always keep order of elements. Note that order of attributes is not significant for XML, so order of attributes may not be the same across queries (unlikely, but theoretically possible) and across different APIs.

I just had a similar problem as you: i had a horrible performance with selecting some nodes in a medium sized xml file (3 MB), using a bunch of indexed XPath expressions.
But in contrary to your solution i didn't have an index in every part of the XPath expression. So i tried to ditch the LINQ to XML using XPath (XElement.XPathSelectElement) but instead used an XPathNavigator by creating an XPathDocument and calling CreateNavigator(). On the navigator i used SelectSingleNode
Using XElement.XPathSelectElement it took me 137.3 seconds to do all the selects (the rest of the program only used about 3 seconds by the way).
Using XPathNavigator.SelectSingleNode the selects now need 1.2 seconds int total... that's a factor of almost 115
So if anyone needs faster XPath queries and doesn't want to parse the queries himself: don't use LINQ to XML if possible, it seems to be implemented horribly performance wise.

I think this is what I am going to go with. I am certain there could be more performance improvements, such as Alexei's suggestion, but this is already at least 10 times faster in my limited tests.
private XElement GetElementFromXPath(XDocument xDoc, string xPath)
{
string[] nodes = xPath.Split(new char[] { '/' }, StringSplitOptions.RemoveEmptyEntries);
XContainer xe = xDoc.Root;
for (int i = 1; i < nodes.Length; i++)
{
string[] chunks = nodes[i].Split(new char[] { '[', ']' });
int index = 0;
if (Int32.TryParse(chunks[1], out index))
xe = xe.Elements(chunks[0]).ElementAt(index - 1);
}
return (XElement)xe;
}
This assumes that all elements other than the root are listed along with their index number in the XPath (which is true for my scenarios).

Confusion regarding BFS or DFS recursion in a treeview

I'm doing some procesing in a treeview, I don't use neither a stack or a queue to process all the nodes, I just do this:
void somemethod(TreeNode root){
foreach(TreeNode item in root.Nodes)
{
//doSomething on item
somemethod(item);
}
}
I'm a litle block right know (can't think with clarity) and I can't see what kind of tree processing I'm doing. Is this BFS or DFS or neither of them?
My clue was DFS but wasn't sure. The CLR don't do anything weird like process two siblings before passing down taking advantage of multiprocessing? That weird tough comes to my mind that clouded my judgment

You are doing a DFS (Depth first search/traversal) right now using recursion.
Its depth first because recursion works the same way as a stack would - you process the children of the current node before you process the next node - so you go for depth first instead of breadth.
Edit:
In response to your comment / updated question: your code will be processed sequentially item by item, there will be no parallel processing, no "magic" involved. The traversal using recursion is equivalent to using a stack (LIFO = last in, first out) - it is just implicit. So your method could also have been written like the following, which produces the same order of traversal:
public void SomeMethod(TreeNode root)
{
Stack<TreeNode> nodeStack = new Stack<TreeNode>();
nodeStack.Push(root);
while (nodeStack.Count > 0)
{
TreeNode node = nodeStack.Pop();
//do something on item
//need to push children in reverse order, so first child is pushed last
foreach (TreeNode item in node.Nodes.Reverse())
nodeStack.Push(item);
}
}
I hope this makes it clearer what is going on - it might be useful for you to write out the nodes to the console as they are being processed or actually walk through step by step with a debugger.
(Also both the recursive method and the one using a stack assume there is no cycle and don't test for that - so the assumption is this is a tree and not any graph. For the later DFS introduces a visited flag to mark nodes already seen)

Im pretty sure your example corresponds to "Depth first search", because the nodes on which you "do something" increase in depth before breadth.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Mysterious failure removing nodes from an XML document - c#

Related

Why does my C# Xml code only work when I enumerate variable enumerable

Use Parallel.ForEach with XmlNodeList in C#

Parsing xml file that comes in as one object per line

XPathSelectElement is very slow; is there a better way to get a value given an XPath?

Confusion regarding BFS or DFS recursion in a treeview

Categories

Resources