How would you compare two XML Documents? - c#

As part of the base class for some extensive unit testing, I am writing a helper function which recursively compares the nodes of one XmlDocument object to another in C# (.NET). Some requirements of this:
The first document is the source, e.g. what I want the XML document to look like. Thus the second is the one I want to find differences in and it must not contain extra nodes not in the first document.
Must throw an exception when too many significant differences are found, and it should be easily understood by a human glancing at the description.
Child element order is important, attributes can be in any order.
Some attributes are ignorable; specifically xsi:schemaLocation and xmlns:xsi, though I would like to be able to pass in which ones are.
Prefixes for namespaces must match in both attributes and elements.
Whitespace between elements is irrelevant.
Elements will either have child elements or InnerText, but not both.
While I'm scrapping something together: has anyone written such code and would it be possible to share it here?
On an aside, what would you call the first and second documents? I've been referring to them as "source" and "target", but it feels wrong since the source is what I want the target to look like, else I throw an exception.

Microsoft has an XML diff API that you can use.
Unofficial NuGet: https://www.nuget.org/packages/XMLDiffPatch.

I googled up a more complete list of solutions of this problem today, I am going to try one of them soon:
http://xmlunit.sourceforge.net/
http://msdn.microsoft.com/en-us/library/aa302294.aspx
http://jolt.codeplex.com/wikipage?title=Jolt.Testing.Assertions.XML.Adaptors
http://www.codethinked.com/checking-xml-for-semantic-equivalence-in-c
https://vkreynin.wordpress.com/tag/xml/
http://gandrusz.blogspot.com/2008/07/recently-i-have-run-into-usual-problem.html
http://xmlspecificationcompare.codeplex.com/
https://github.com/netbike/netbike.xmlunit

This code doesn't satisfy all your requirements, but it's simple and I'm using for my unit tests. Attribute order doesn't matter, but element order does. Element inner text is not compared. I also ignored case when comparing attributes, but you can easily remove that.
public bool XMLCompare(XElement primary, XElement secondary)
{
if (primary.HasAttributes) {
if (primary.Attributes().Count() != secondary.Attributes().Count())
return false;
foreach (XAttribute attr in primary.Attributes()) {
if (secondary.Attribute(attr.Name.LocalName) == null)
return false;
if (attr.Value.ToLower() != secondary.Attribute(attr.Name.LocalName).Value.ToLower())
return false;
}
}
if (primary.HasElements) {
if (primary.Elements().Count() != secondary.Elements().Count())
return false;
for (var i = 0; i <= primary.Elements().Count() - 1; i++) {
if (XMLCompare(primary.Elements().Skip(i).Take(1).Single(), secondary.Elements().Skip(i).Take(1).Single()) == false)
return false;
}
}
return true;
}

try XMLUnit. This library is available for both Java and .Net

For comparing two XML outputs in automated testing I found XNode.DeepEquals.
Compares the values of two nodes, including the values of all descendant nodes.
Usage:
var xDoc1 = XDocument.Parse(xmlString1);
var xDoc2 = XDocument.Parse(xmlString2);
bool isSame = XNode.DeepEquals(xDoc1.Document, xDoc2.Document);
//Assert.IsTrue(isSame);
Reference: https://learn.microsoft.com/en-us/dotnet/api/system.xml.linq.xnode.deepequals?view=netcore-2.2

Comparing XML documents is complicated. Google for xmldiff (there's even a Microsoft solution) for some tools. I've solved this a couple of ways. I used XSLT to sort elements and attributes (because sometimes they would appear in a different order, and I didn't care about that), and filter out attributes I didn't want to compare, and then either used the XML::Diff or XML::SemanticDiff perl module, or pretty printed each document with every element and attribute on a separate line, and using Unix command line diff on the results.

https://github.com/CameronWills/FatAntelope
Another alternative library to the Microsoft XML Diff API. It has a XML diffing algorithm to do an unordered comparison of two XML documents and produce an optimal matching.
It is a C# port of the X-Diff algorithm described here:
http://pages.cs.wisc.edu/~yuanwang/xdiff.html
Disclaimer: I wrote it :)

Another way to do this would be -
Get the contents of both files into two different strings.
Transform the strings using an XSLT (which will just copy everything over to two new strings). This will ensure that all spaces outside the elements are removed. This will result it two new strings.
Now, just compare the two strings with each other.
This won't give you the exact location of the difference, but if you just want to know if there is a difference, this is easy to do without any third party libraries.

I am using ExamXML for comparing XML files. You can try it.
The authors, A7Soft, also provide API for comparing XML files

Not relevant for the OP since it currently ignores child order, but if you want a code only solution you can try XmlSpecificationCompare which I somewhat misguidedly developed.

All above answers are helpful but I tried XMLUnit which look's easy to use Nuget package to check difference between two XML files, here is C# sample code
public static bool CheckXMLDifference(string xmlInput, string xmlOutput)
{
Diff myDiff = DiffBuilder.Compare(Input.FromString(xmlInput))
.WithTest(Input.FromString(xmlOutput))
.CheckForSimilar().CheckForIdentical()
.IgnoreComments()
.IgnoreWhitespace().NormalizeWhitespace().Build();
if(myDiff.Differences.Count() == 0)
{
// when there is no difference
// files are identical, return true;
return true;
}
else
{
//return false when there is 1 or more difference in file
return false;
}
}
If anyone want's to test it, I have also created online tool using it, you can take a look here
https://www.minify-beautify.com/online-xml-difference

Based #Two Cents answer and using this link XMLSorting i have created my own XmlComparer
Compare XML program
private static bool compareXML(XmlNode node, XmlNode comparenode)
{
if (node.Value != comparenode.Value)
return false;
if (node.Attributes.Count>0)
{
foreach (XmlAttribute parentnodeattribute in node.Attributes)
{
string parentattributename = parentnodeattribute.Name;
string parentattributevalue = parentnodeattribute.Value;
if (parentattributevalue != comparenode.Attributes[parentattributename].Value)
{
return false;
}
}
}
if(node.HasChildNodes)
{
sortXML(comparenode);
if (node.ChildNodes.Count != comparenode.ChildNodes.Count)
return false;
for(int i=0; i<node.ChildNodes.Count;i++)
{
string name = node.ChildNodes[i].LocalName;
if (compareXML(node.ChildNodes[i], comparenode.ChildNodes[i]) == false)
return false;
}
}
return true;
}
Sort XML program
private static void sortXML(XmlNode documentElement)
{
int i = 1;
SortAttributes(documentElement.Attributes);
SortElements(documentElement);
foreach (XmlNode childNode in documentElement.ChildNodes)
{
sortXML(childNode);
}
}
private static void SortElements(XmlNode rootNode)
{
for(int j = 0; j < rootNode.ChildNodes.Count; j++) {
for (int i = 1; i < rootNode.ChildNodes.Count; i++)
{
if (String.Compare(rootNode.ChildNodes[i].Name, rootNode.ChildNodes[1 - 1].Name) < 0)
{
rootNode.InsertBefore(rootNode.ChildNodes[i], rootNode.ChildNodes[i - 1]);
}
}
}
// Console.WriteLine(j++);
}
private static void SortAttributes(XmlAttributeCollection attribCol)
{
if (attribCol == null)
return;
bool changed = true;
while (changed)
{
changed = false;
for (int i = 1; i < attribCol.Count; i++)
{
if (String.Compare(attribCol[i].Name, attribCol[i - 1].Name) < 0)
{
//Replace
attribCol.InsertBefore(attribCol[i], attribCol[i - 1]);
changed = true;
}
}
}
}

I solved this problem of xml comparison using XSLT 1.0 which can be used for comparing large xml files using an unordered tree comparison algorithm.
https://github.com/sflynn1812/xslt-diff-turbo

Related

How to efficiently find the index of value in a System.Numerics.Vector<T>?

I am exploring System.Numerics.Vector with .NET Framework 4.7.2 (the project I am working on cannot be migrated to .NET Core 3 and use the new Intrinsics namespace yet). The project is processing very large CSV/TSV files and we spend a lot of times looping through strings to find commas, quotes, etc. and I am trying to speed up the process.
So far, I have been able to use Vector to identify if a string contains a given character or not (using EqualsAny method). That’s great, but I want to go a little further. I want to efficiently find the index of that character using Vector. I do not know how. Below is he function I use to determine if a string contains a comma or not.
private static readonly char Comma = ',';
public static bool HasCommas(this string s)
{
if (s == null)
{
return false;
}
ReadOnlySpan<char> charSpan = s.AsSpan();
ReadOnlySpan<Vector<ushort>> charAsVectors = MemoryMarshal.Cast<char, Vector<ushort>>(charSpan);
foreach (Vector<ushort> v in charAsVectors)
{
bool foundCommas = Vector.EqualsAny(v, StringExtensions.Commas);
if (foundCommas)
{
return true;
}
}
int numberOfCharactersProcessedSoFar = charAsVectors.Length * Vector<ushort>.Count;
if (s.Length > numberOfCharactersProcessedSoFar)
{
for (int i = numberOfCharactersProcessedSoFar; i < s.Length; i++)
{
if (s[i] == ',')
{
return true;
}
}
}
return false;
}
I understand that I could use the function above and scan the resulting Vector, but it would defeat the purpose of using a Vector. I heard about the new Intrinsics library that could help, but I cannot upgrade my project to .NET Core 3.
Given a Vector, how would you efficiently find the position of a character? Is there a clever trick that I am not aware of?

Using LINQ in a string array to improve efficient C#

I have a equation string and when I split it with a my pattern I get the folowing string array.
string[] equationList = {"code1","+","code2","-","code3"};
Then from this I create a list which only contains the codes.
List<string> codeList = {"code1","code2","code3"};
Then existing code loop through the codeList and retrieve the value of each code and replaces the value in the equationList with the below code.
foreach (var code in codeList ){
var codeVal = GetCodeValue(code);
for (var i = 0; i < equationList.Length; i++){
if (!equationList[i].Equals(code,StringComparison.InvariantCultureIgnoreCase)) continue;
equationList[i] = codeVal;
break;
}
}
I am trying to improve the efficiency and I believe I can get rid of the for loop within the foreach by using linq.
My question is would it be any better if I do in terms of speeding up the process?
If yes then can you please help with the linq statement?
Before jumping to LINQ... which doesn't solve any problems you've described, let's look at the logic you have here.
We split a string with a 'pattern'. How?
We then create a new list of codes. How?
We then loop through those codes and decode them. How?
But since we forgot to keep track of where those code came from, we now loop through the equationList (which is an array, not a List<T>) to substitute the results.
Seems a little convoluted to me.
Maybe a simpler solution would be:
Take in a string, and return IEnumerable<string> of words (similar to what you do now).
Take in a IEnumerable<string> of words, and return a IEnumerable<?> of values.
That is to say with this second step iterate over the strings, and simply return the value you want to return - rather than trying to extract certain values out, parsing them, and then inserting them back into a collection.
//Ideally we return something more specific eg, IEnumerable<Tokens>
public IEnumerable<string> ParseEquation(IEnumerable<string> words)
{
foreach (var word in words)
{
if (IsOperator(word)) yield return ToOperator(word);
else if (IsCode(word)) yield return ToCode(word);
else ...;
}
}
This is quite similar to the LINQ Select Statement... if one insisted I would suggest writing something like so:
var tokens = equationList.Select(ToToken);
...
public Token ToToken(string word)
{
if (IsOperator(word)) return ToOperator(word);
else if (IsCode(word)) return ToCode(word);
else ...;
}
If GetCodeValue(code) doesn't already, I suggest it probably could use some sort of caching/dictionary in its implementation - though the specifics dictate this.
The benefits of this approach is that it is flexible (we can easily add more processing steps), simple to follow (we put in these values and get these as a result, no mutating state) and easy to write. It also breaks the problem down into nice little chunks that solve their own task, which will help immensely when trying to refactor, or find niggly bugs/performance issues.
If your array is always alternating codex then operator this LINQ should do what you want:
string[] equationList = { "code1", "+", "code2", "-", "code3" };
var processedList = equationList.Select((s,j) => (j % 2 == 1) ? s :GetCodeValue(s)).ToArray();
You will need to check if it is faster
I think the fastest solution will be this:
var codeCache = new Dictionary<string, string>();
for (var i = equationList.Length - 1; i >= 0; --i)
{
var item = equationList[i];
if (! < item is valid >) // you know this because you created the codeList
continue;
string codeVal;
if (!codeCache.TryGetValue(item, out codeVal))
{
codeVal = GetCodeValue(item);
codeCache.Add(item, codeVal);
}
equationList[i] = codeVal;
}
You don't need a codeList. If every code is unique you can remove the codeCace.

is String.Contains() faster than walking through whole array of char in string?

I have a function that is walking through the string looking for pattern and changing parts of it. I could optimize it by inserting
if (!text.Contains(pattern)) return;
But, I am actually walking through the whole string and comparing parts of it with the pattern, so the question is, how String.Contains() actually works? I know there was such a question - How does String.Contains work? but answer is rather unclear. So, if String.Contains() walks through the whole array of chars as well and compare them to pattern I am looking for as well, it wouldn't really make my function faster, but slower.
So, is it a good idea to attempt such an optimizations? And - is it possible for String.Contains() to be even faster than function that just walk through the whole array and compare every single character with some constant one?
Here is the code:
public static char colorchar = (char)3;
public static Client.RichTBox.ContentText color(string text, Client.RichTBox SBAB)
{
if (text.Contains(colorchar.ToString()))
{
int color = 0;
bool closed = false;
int position = 0;
while (text.Length > position)
{
if (text[position] == colorchar)
{
if (closed)
{
text = text.Substring(position, text.Length - position);
Client.RichTBox.ContentText Link = new Client.RichTBox.ContentText(ProtocolIrc.decode_text(text), SBAB, Configuration.CurrentSkin.mrcl[color]);
return Link;
}
if (!closed)
{
if (!int.TryParse(text[position + 1].ToString() + text[position + 2].ToString(), out color))
{
if (!int.TryParse(text[position + 1].ToString(), out color))
{
color = 0;
}
}
if (color > 9)
{
text = text.Remove(position, 3);
}
else
{
text = text.Remove(position, 2);
}
closed = true;
if (color < 16)
{
text = text.Substring(position);
break;
}
}
}
position++;
}
}
return null;
}
Short answer is that your optimization is no optimization at all.
Basically, String.Contains(...) just returns String.IndexOf(..) >= 0
You could improve your alogrithm to:
int position = text.IndexOf(colorchar.ToString()...);
if (-1 < position)
{ /* Do it */ }
Yes.
And doesn't have a bug (ahhm...).
There are better ways of looking for multiple substrings in very long texts, but for most common usages String.Contains (or IndexOf) is the best.
Also IIRC the source of String.Contains is available in the .Net shared sources
Oh, and if you want a performance comparison you can just measure for your exact use-case
Check this similar post How does string.contains work
I think that you will not be able to simply do anything faster than String.Contains, unless you want to use standard CRT function wcsstr, available in msvcrt.dll, which is not so easy
Unless you have profiled your application and determined that the line with String.Contains is a bottle-neck, you should not do any such premature optimizations. It is way more important to keep your code's intention clear.
Ans while there are many ways to implement the methods in the .NET base classes, you should assume the default implementations are optimal enough for most people's use cases. For example, any (future) implementation of .NET might use the x86-specific instructions for string comparisons. That would then always be faster than what you can do in C#.
If you really want to be sure whether your custom string comparison code is faster than String.Contains, you need to measure them both using many iterations, each with a different string. For example using the Stopwatch class to measure the time.
If you now the details which you can use for optimizations (not just simple contains check) sure you can make your method faster than string.Contains, otherwise - not.

How to count the number of code lines in a C# solution, without comments and empty lines, and other redundant stuff, etc?

By redundant stuff, I mean the namespaces, as I know they are necessary but if there are 10k of them, it doesn't add valuable info to the table.
Could this be done using Linq?
Visual studio will do this for you. Right click on your project and choose Calculate Code Metrics.
No need to reinvent the wheel. Take a look at the Visual Studio Code Metrics PowerTool 11.0
Overview
The Code Metrics PowerTool is a command line utility that calculates code metrics for your managed code and saves them to an XML file. This tool enables teams to collect and report code metrics as part of their build process. The code metrics calculated are:
• Maintainability Index
• Cyclomatic Complexity
• Depth of Inheritance
• Class Coupling
• Lines Of Code (LOC)
I know you said you don't have Ultimate, so I just wanted to show you what you're missing.
For everyone else, there's SourceMonitor
From: http://rajputyh.blogspot.in/2014/02/counting-number-of-real-lines-in-your-c.html
private int CountNumberOfLinesInCSFilesOfDirectory(string dirPath)
{
FileInfo[] csFiles = new DirectoryInfo(dirPath.Trim())
.GetFiles("*.cs", SearchOption.AllDirectories);
int totalNumberOfLines = 0;
Parallel.ForEach(csFiles, fo =>
{
Interlocked.Add(ref totalNumberOfLines, CountNumberOfLine(fo));
});
return totalNumberOfLines;
}
private int CountNumberOfLine(Object tc)
{
FileInfo fo = (FileInfo)tc;
int count = 0;
int inComment = 0;
using (StreamReader sr = fo.OpenText())
{
string line;
while ((line = sr.ReadLine()) != null)
{
if (IsRealCode(line.Trim(), ref inComment))
count++;
}
}
return count;
}
private bool IsRealCode(string trimmed, ref int inComment)
{
if (trimmed.StartsWith("/*") && trimmed.EndsWith("*/"))
return false;
else if (trimmed.StartsWith("/*"))
{
inComment++;
return false;
}
else if (trimmed.EndsWith("*/"))
{
inComment--;
return false;
}
return
inComment == 0
&& !trimmed.StartsWith("//")
&& (trimmed.StartsWith("if")
|| trimmed.StartsWith("else if")
|| trimmed.StartsWith("using (")
|| trimmed.StartsWith("else if")
|| trimmed.Contains(";")
|| trimmed.StartsWith("public") //method signature
|| trimmed.StartsWith("private") //method signature
|| trimmed.StartsWith("protected") //method signature
);
}
Comments of // and /* kind are ignored.
A statement written in multiple line is considered single line.
brackets are (i.e. '{') not considered lines.
'using namespace' line are ignored.
Lines which are class name etc. are ignored.
I have no solid idea about them, but you can use Code Metrics Values to get some statistics about your solution, like code lines.
we have used the tfs cube to get the data about how many lines add/delete/change on our tfs. This one you can view from excel. But need to configure it properly. And I don't think it will exclude the comments and blank lines etc.
Ctrl+Shift+f (Find in files) -> put ";" in the "Find what:"-textbox -> Press "Find All"-button.
This extremly simple method makes use of the fact, that any C# statement is terminated with a semicolon. And, at least I dont't use semicolons at any other place (e.g. in comments)...

Binary search tree traversal that compares two pointers for equality

I'm reading the Cormen algorithms book (binary search tree chapter) and it says that there are two ways to traverse the tree without recursion:
using stack and
a more complicated but elegant
solution that uses no stack but
assumes that two pointers can be
tested for equality
I've implemented the first option (using stack), but don't know how to implement the latter.
This is not a homework, just reading to educate myself.
Any clues as to how to implement the second one in C#?
Sure thing. You didn't say what kind of traversal you wanted, but here's the pseudocode for an in-order traversal.
t = tree.Root;
while (true) {
while (t.Left != t.Right) {
while (t.Left != null) { // Block one.
t = t.Left;
Visit(t);
}
if (t.Right != null) { // Block two.
t = t.Right;
Visit(t);
}
}
while (t != tree.Root && (t.Parent.Right == t || t.Parent.Right == null)) {
t = t.Parent;
}
if (t != tree.Root) { // Block three.
t = t.Parent.Right;
Visit(t);
} else {
break;
}
}
To get pre- or post-order, you rearrange the order of the blocks.
Assuming that the nodes in the tree are references and the values are references, you can always call the static ReferenceEquals method on the Object class to compare to see if the references for any two nodes/values are the same.

Categories