Querying very large xml files

Querying very large xml files - c#

I have a merged very large xml file on scale of GB's. I am using following code with xpath queries to read and process data.
IColumn column = output.Schema.FirstOrDefault(col => col.Type != typeof(string));
if (column != null)
{
throw new ArgumentException(string.Format("Column '{0}' must be of type 'string', not '{1}'", column.Name, column.Type.Name));
}
XmlReaderSettings settings = new XmlReaderSettings();
settings.ConformanceLevel = ConformanceLevel.Auto;//.Fragment;
XmlReader r = XmlReader.Create(input.BaseStream, settings);
XmlDocument xmlDocument = new XmlDocument();
xmlDocument.Load(r);
//xmlDocument.LoadXml("<root/>");
//xmlDocument.DocumentElement.CreateNavigator().AppendChild(r);
//xmlDocument.Load(input.BaseStream);
XmlNamespaceManager nsmgr = new XmlNamespaceManager(xmlDocument.NameTable);
if (this.namespaces != null)
{
foreach (Match nsdef in xmlns.Matches(this.namespaces))
{
string prefix = nsdef.Groups[1].Value;
string uri = nsdef.Groups[3].Value;
nsmgr.AddNamespace(prefix, uri);
}
}
foreach (XmlNode xmlNode in xmlDocument.DocumentElement.SelectNodes(this.rowPath, nsmgr))
{
foreach (IColumn col in output.Schema)
{
var explicitColumnMapping = this.columnPaths.FirstOrDefault(columnPath => columnPath.Value == col.Name);
XmlNode xml = xmlNode.SelectSingleNode(explicitColumnMapping.Key ?? col.Name, nsmgr);
output.Set(explicitColumnMapping.Value ?? col.Name, xml == null ? null : xml.InnerXml);
}
yield return output.AsReadOnly();
}
However it only works well for smaller files on scale of MBs. It works fine locally but fails for ADLA. I need to use the namespace manager as well. How can i scale it so i can process bigger files. On submitting job with huge file I always get this error with no information.
VertexFailedError

Copying answer I gave in MSDN Forum to same question:
U-SQL Extractors by default are scaled out to work in parallel over smaller parts of the input files, called extents. These extents are about 250MB in size each.
If the data you are processing cannot fit into an extent, you have to tell the extractor with a C# attribute, that the extractor has to see the file in its entirety. You do that by adding the following part ahead of your extractor class:
[SqlUserDefinedExtractor(AtomicFileProcessing = true)]
Now in your case, XML documents obviously cannot be split since the parser needs to see the beginning and end of a document. This is especially true if you only have a single XML document (side note: Having GBs of a single XML document or JSON document is in my opinion often a bad idea).
Furthermore, I would suggest that you look at the sample XML extractor that we provide on our GitHub site here: https://github.com/Azure/usql/tree/master/Examples/DataFormats

Related

How can I find a specific XML element programmatically?

I have this chunk of XML
<EnvelopeStatus>
<CustomFields>
<CustomField>
<Name>Matter ID</Name>
<Show>True</Show>
<Required>True</Required>
<Value>3</Value>
</CustomField>
<CustomField>
<Name>AccountId</Name>
<Show>false</Show>
<Required>false</Required>
<Value>10804813</Value>
<CustomFieldType>Text</CustomFieldType>
</CustomField>
I have this code below:
// TODO find these programmatically rather than a strict path.
var accountId = envelopeStatus.SelectSingleNode("./a:CustomFields", mgr).ChildNodes[1].ChildNodes[3].InnerText;
var matterId = envelopeStatus.SelectSingleNode("./a:CustomFields", mgr).ChildNodes[0].ChildNodes[3].InnerText;
The problem is, sometimes the CustomField with 'Matter ID' might not be there. So I need a way to find the element based on what 'Name is', i.e. a programmatic way of finding it. I can't rely on indexes being accurate.

You can use this code to read innertext from a specific element:
XmlDocument doc = new XmlDocument();
doc.Load("your.xml");
XmlNodeList Nodes= doc.SelectNodes("/EnvelopeStatus/CustomField");
if (((Nodes!= null) && (Nodes.Count > 0)))
{
foreach (XmlNode Level1 in Nodes)
{
if (Level1.ChildNodes[1].Name == "name")
{
string text = Convert.ToInt32(Level1.ChildNodes[1].InnerText.ToString());
}
}
}

You can often find anything in a XML document by utilizing the XPath capabilities that is available directly in the .NET Framework versions.
Maybe create a small XPath parser helper class
public class EnvelopeStatusParser
{
public XmlNodeList GetNodesWithName(XmlDocument doc, string name)
{
return doc.SelectNodes($"//CustomField[Name[text()='{name}']]");
}
}
and then call it like below to get all CustomFields which have a Name that equals what you need to search for
// Creating the XML Document in some form - here reading from file
XmlDocument doc = new XmlDocument();
doc.Load(#"envelopestatus.xml");
var parser = new EnvelopeStatusParser();
var matchingNodes = parser.GetNodesWithName(doc, "Matter ID");
Console.WriteLine(matchingNodes);
matchingNodes = parser.GetNodesWithName(doc, "NotHere");
Console.WriteLine(matchingNodes);
There exist numerous XPath cheat sheets around - like this one from LaCoupa - xpath-cheatsheet which can be quiet helpful to fully utilize XPath on XML structures.

How to Read Xaml Content in WPF application and fetch all the UI Control attributes [duplicate]

How do I read and parse an XML file in C#?

XmlDocument to read an XML from string or from file.
using System.Xml;
XmlDocument doc = new XmlDocument();
doc.Load("c:\\temp.xml");
or
doc.LoadXml("<xml>something</xml>");
then find a node below it ie like this
XmlNode node = doc.DocumentElement.SelectSingleNode("/book/title");
or
foreach(XmlNode node in doc.DocumentElement.ChildNodes){
string text = node.InnerText; //or loop through its children as well
}
then read the text inside that node like this
string text = node.InnerText;
or read an attribute
string attr = node.Attributes["theattributename"]?.InnerText
Always check for null on Attributes["something"] since it will be null if the attribute does not exist.

LINQ to XML Example:
// Loading from a file, you can also load from a stream
var xml = XDocument.Load(#"C:\contacts.xml");
// Query the data and write out a subset of contacts
var query = from c in xml.Root.Descendants("contact")
where (int)c.Attribute("id") < 4
select c.Element("firstName").Value + " " +
c.Element("lastName").Value;
foreach (string name in query)
{
Console.WriteLine("Contact's Full Name: {0}", name);
}
Reference: LINQ to XML at MSDN

Here's an application I wrote for reading xml sitemaps:
using System;
using System.Collections.Generic;
using System.Windows.Forms;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using System.IO;
using System.Data;
using System.Xml;
namespace SiteMapReader
{
class Program
{
static void Main(string[] args)
{
Console.WriteLine("Please Enter the Location of the file");
// get the location we want to get the sitemaps from
string dirLoc = Console.ReadLine();
// get all the sitemaps
string[] sitemaps = Directory.GetFiles(dirLoc);
StreamWriter sw = new StreamWriter(Application.StartupPath + #"\locs.txt", true);
// loop through each file
foreach (string sitemap in sitemaps)
{
try
{
// new xdoc instance
XmlDocument xDoc = new XmlDocument();
//load up the xml from the location
xDoc.Load(sitemap);
// cycle through each child noed
foreach (XmlNode node in xDoc.DocumentElement.ChildNodes)
{
// first node is the url ... have to go to nexted loc node
foreach (XmlNode locNode in node)
{
// thereare a couple child nodes here so only take data from node named loc
if (locNode.Name == "loc")
{
// get the content of the loc node
string loc = locNode.InnerText;
// write it to the console so you can see its working
Console.WriteLine(loc + Environment.NewLine);
// write it to the file
sw.Write(loc + Environment.NewLine);
}
}
}
}
catch { }
}
Console.WriteLine("All Done :-)");
Console.ReadLine();
}
static void readSitemap()
{
}
}
}
Code on Paste Bin
http://pastebin.com/yK7cSNeY

There are lots of way, some:
XmlSerializer. use a class with the target schema
you want to read - use XmlSerializer
to get the data in an Xml loaded into
an instance of the class.
Linq 2 xml
XmlTextReader.
XmlDocument
XPathDocument (read-only access)

You could use a DataSet to read XML strings.
var xmlString = File.ReadAllText(FILE_PATH);
var stringReader = new StringReader(xmlString);
var dsSet = new DataSet();
dsSet.ReadXml(stringReader);
Posting this for the sake of information.

You can either:
Use XmlSerializer class
Use XmlDocument class
Examples are on the msdn pages provided

Linq to XML.
Also, VB.NET has much better xml parsing support via the compiler than C#. If you have the option and the desire, check it out.

Check out XmlTextReader class for instance.

There are different ways, depending on where you want to get.
XmlDocument is lighter than XDocument, but if you wish to verify minimalistically that a string contains XML, then regular expression is possibly the fastest and lightest choice you can make. For example, I have implemented Smoke Tests with SpecFlow for my API and I wish to test if one of the results in any valid XML - then I would use a regular expression. But if I need to extract values from this XML, then I would parse it with XDocument to do it faster and with less code. Or I would use XmlDocument if I have to work with a big XML (and sometimes I work with XML's that are around 1M lines, even more); then I could even read it line by line. Why? Try opening more than 800MB in private bytes in Visual Studio; even on production you should not have objects bigger than 2GB. You can with a twerk, but you should not. If you would have to parse a document, which contains A LOT of lines, then this documents would probably be CSV.
I have written this comment, because I see a lof of examples with XDocument. XDocument is not good for big documents, or when you only want to verify if there the content is XML valid. If you wish to check if the XML itself makes sense, then you need Schema.
I also downvoted the suggested answer, because I believe it needs the above information inside itself. Imagine I need to verify if 200M of XML, 10 times an hour, is valid XML. XDocument will waste a lof of resources.
prasanna venkatesh also states you could try filling the string to a dataset, it will indicate valid XML as well.

public void ReadXmlFile()
{
string path = HttpContext.Current.Server.MapPath("~/App_Data"); // Finds the location of App_Data on server.
XmlTextReader reader = new XmlTextReader(System.IO.Path.Combine(path, "XMLFile7.xml")); //Combines the location of App_Data and the file name
while (reader.Read())
{
switch (reader.NodeType)
{
case XmlNodeType.Element:
break;
case XmlNodeType.Text:
columnNames.Add(reader.Value);
break;
case XmlNodeType.EndElement:
break;
}
}
}
You can avoid the first statement and just specify the path name in constructor of XmlTextReader.

If you want to retrive a particular value from an XML file
XmlDocument _LocalInfo_Xml = new XmlDocument();
_LocalInfo_Xml.Load(fileName);
XmlElement _XmlElement;
_XmlElement = _LocalInfo_Xml.GetElementsByTagName("UserId")[0] as XmlElement;
string Value = _XmlElement.InnerText;

Here is another approach using Cinchoo ETL - an open source library to parse xml file with few lines of code.
using (var r = ChoXmlReader<Item>.LoadText(xml)
.WithXPath("//item")
)
{
foreach (var rec in r)
rec.Print();
}
public class Item
{
public string Name { get; set; }
public string ProtectionLevel { get; set; }
public string Description { get; set; }
}
Sample fiddle: https://dotnetfiddle.net/otYq5j
Disclaimer: I'm author of this library.

How to use LINQ to XML when XML parent and child nodes have the same name

I am trying to extract some SQL data to XML from a Microsoft Dynamics environment, I am currently using LINQ To XML in C# to read and write to my XML files. One piece of data I need is from a view called SECURITYSUBROLE. Looking at the structure of this view shows that there is a column also named SECURITYSUBROLE. My normal method of extraction has given me this XML.
<SECURITYSUBROLE>
<SECURITYROLE>886301</SECURITYROLE>
<SECURITYSUBROLE>886317</SECURITYSUBROLE>
<VALIDFROM>1900-01-01T00:00:00-06:00</VALIDFROM>
<VALIDFROMTZID>0</VALIDFROMTZID>
<VALIDTO>1900-01-01T00:00:00-06:00</VALIDTO>
<VALIDTOTZID>0</VALIDTOTZID>
<RECVERSION>1</RECVERSION>
<RECID>886317</RECID>
</SECURITYSUBROLE>
When I try to import this data later on, I am getting errors because the parent XML node has the same name as a child node. Here is a snippet of the import method:
XmlReaderSettings settings = new XmlReaderSettings();
settings.CheckCharacters = false;
XmlReader reader = XmlReader.Create(path, settings);
reader.MoveToContent();
int count = 1;
List<XElement> xmlSubset = new List<XElement>();
while (reader.ReadToFollowing(xmlTag))
{
if (count % 1000 == 0)
{
xmlSubset.Add(XElement.Load(reader.ReadSubtree()));
XDocument xmlTemp = new XDocument(new XElement(xmlTag));
foreach (XElement elem in xmlSubset)
{
xmlTemp.Root.Add(elem);
}
xmlSubset = new List<XElement>();
ImportTableByName(connectionString, tableName, xmlTemp);
count = 1;
}
else
{
xmlSubset.Add(XElement.Load(reader.ReadSubtree()));
count++;
}
}
}
It's currently failing on the XmlReader.ReadToFollowing, where it doesn't know where to go next because of the name confusion. So my question has two parts:
1) Is there some better way to be extracting this data other than to XML?
2) Is there a way through LINQ To XML that I can somehow differentiate between the parent and child nodes named exactly the same?

To get the elements (in your case) for SECURITYSUBROLE you can check to see if the element's have children:
XElement root = XElement.Load(path);
var subroles = root.Descendants("SECURITYSUBROLE") // all subroles
.Where(x => !x.HasElements); // only subroles without children

I'm going to suggest a different approach:
1) VS2013 (possibly earlier versions too) has a function to create a class from an XML source. So get one of your XML files and copy the content to your clipboard. Then in a new class file Edit --> Paste Special --> Paste XML as Classes
2) Look into XmlSerialization which will allow you to convert an XML file into an in memory object with a strongly typed class.
XmlSerializer s = new XmlSerializer(yourNewClassTYPE);
TextReader r = new StreamReader(XmlFileLocation);
var dataFromYourXmlAsAStronglyTypedClass = (yourNewlyClassTYPE) s.Deserialize(r);
r.Close();

Reading a single node from XML file and using it as a condition

I am simply trying to read a particular node from an XML and use it as a string variable in a condition. This gets me to the XML file and gives me the whole thing.
string url = #"http://agent.mtconnect.org/current";
XmlDocument xmlDoc = new XmlDocument();
xmlDoc.Load(url);
richTextBox1.Text = xmlDoc.InnerXml;
But I need the power state "ON" of "OFF" (XML section below, can view the whole XML online)
<Events><PowerState dataItemId="p2" timestamp="2013-03-11T12:27:30.275747" name="power" sequence="4042868976">ON</PowerState></Events>
I have tried everything I know of. I am just not that familiar with XML files. and the other posts get me nowhere.
HELP PLEASE!

You may try LINQ2XML for that:
string value = (string) (XElement.Load("http://agent.mtconnect.org/current")
.Descendants().FirstOrDefault(d => d.Name.LocalName == "PowerState"))

If you wanted to avoid LINQ, or if it is not working for you you can use straight XML traversal for this:
string url = #"http://agent.mtconnect.org/current";
System.Xml.XmlDocument xmlDoc = new System.Xml.XmlDocument();
xmlDoc.Load(url);
System.Xml.XmlNamespaceManager theNameManager = new System.Xml.XmlNamespaceManager(xmlDoc.NameTable);
theNameManager.AddNamespace("mtS", "urn:mtconnect.org:MTConnectStreams:1.2");
theNameManager.AddNamespace("m", "urn:mtconnect.org:MTConnectStreams:1.2");
theNameManager.AddNamespace("xsi", "http://www.w3.org/2001/XMLSchema-instance");
System.Xml.XmlElement DeviceStreams = (System.Xml.XmlElement)xmlDoc.SelectSingleNode("descendant::mtS:DeviceStream", theNameManager);
System.Xml.XmlNodeList theStreams = DeviceStreams.SelectNodes("descendant::mtS:ComponentStream", theNameManager);
foreach (System.Xml.XmlNode CompStream in theStreams)
{
if (CompStream.Attributes["component"].Value == "Electric")
{
System.Xml.XmlElement EventElement = (System.Xml.XmlElement)CompStream.SelectSingleNode("descendant::mtS:Events", theNameManager);
System.Xml.XmlElement PowerElement = (System.Xml.XmlElement)EventElement.SelectSingleNode("descendant::mtS:PowerState", theNameManager);
Console.Out.WriteLine(PowerElement.InnerText);
Console.In.Read();
}
}
When traversing any document with a default namespace in the root node, I have found it is imperative to have a namespace manager. Without it the document is just un-navigable.
I created this code in a console application. It worked for me. Also I am no guru and I may be making some mistakes here. I am not sure if there is some way to have the default namespace referenced without naming it (mtS). Anyone who knows how to make this cleaner or more efficient please comment.
EDIT:
For one less level of 'clunk' you can change this:
if (CompStream.Attributes["component"].Value == "Electric")
{
Console.Out.WriteLine(((System.Xml.XmlElement)CompStream.SelectSingleNode("descendant::mtS:Events", theNameManager)).InnerText;);
Console.In.Read();
}
because there is only one element in there and its innerText is all you will get.

Sample code for converting TFS work item to XML?

I want to write a simple program to query TFS and convert all of the work items to a uniform type XML file and save them to separate files in a folder.
I'm sure this kind of work is commonly enough done, and is very simple - but I can find no samples on the Internet and no way of connecting to TFS programmatically and retrieving only work item info. Would anyone be able to help me out ?
Thanks very much

private TfsTeamProjectCollection GetTfsTeamProjectCollection()
{
TeamProjectPicker workitemPicker = new TeamProjectPicker(TeamProjectPickerMode.SingleProject, false, new UICredentialsProvider());
workitemPicker.AcceptButtonText = "workitemPicker.AcceptButtonText";
workitemPicker.Text = "workitemPicker.Text";
workitemPicker.ShowDialog();
if (workitemPicker.SelectedProjects != null || workitemPicker.SelectedProjects.Length > 0)
{
return workitemPicker.SelectedTeamProjectCollection;
}
return null;
}
private WorkItemCollection WorkItemByQuery(TfsTeamProjectCollection projects, string query) //query is likethis:SELECT [System.ID], [System.Title] FROM WorkItems WHERE [System.Title] CONTAINS 'Lei Yang'
{
WorkItemStore wis = new WorkItemStore(projects);
return wis.Query (query );
}
WorkItemCollection is what you want. You can get WorkItems and their properties.

You should use the TFS SDK.
You can find many tutorials on internet like this one.
MSDN will also help you too.

You can use the TFS SDK working with work items. Then all you need to do is convert it into the format you want.

You can fetch the query results as suggested by Lei Yang.
Then start building the XML.
XmlDocument xmlDoc = new XmlDocument();
//XML declaration
XmlDeclaration xmlDeclaration = xmlDoc.CreateXmlDeclaration("1.0", "utf-8", null);
// Create the root element
XmlElement rootNode = xmlDoc.CreateElement("WorkItemFieldList");
xmlDoc.InsertBefore(xmlDeclaration, xmlDoc.DocumentElement);
xmlDoc.AppendChild(rootNode);
//Create a new element and add it to the root node
XmlElement parentnode = xmlDoc.CreateElement("UserInput");
Iterate through all the workitem field values
xmlDoc.DocumentElement.PrependChild(parentnode);
//wiTrees of type WorkItemLinkInfo[] is the result of RunLinkQuery
foreach (var item in wiTrees)
{
int fieldcount = workItemStore.GetWorkItem(item.TargetId).Fields.Count;
while (fieldcount > 0)
{
//Create the required nodes
XmlElement mainNode = xmlDoc.CreateElement(workItemStore.GetWorkItem(item.TargetId).Fields[fieldcount -1].Name.ToString().Replace(" ", "-"));
// retrieve the text
//Use the custom method NullSafeToString to handle null values and convert them to String.Empty
XmlText categoryText = xmlDoc.CreateTextNode(workItemStore.GetWorkItem(item.TargetId).Fields[fieldcount - 1].Value.NullSafeToString().ToString());
// append the nodes to the parentNode without the value
parentnode.AppendChild(mainNode);
// save the value of the fields into the nodes
mainNode.AppendChild(categoryText);
fieldcount--;
}
}
// Save to the XML file
xmlDoc.Save("widetails.xml");

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.