Reading XML File ( File size > 500 MB)

Reading XML File ( File size > 500 MB) - c#

I'm trying to parse large XML file (size near about 600MB) and using
It's taking longer time and finally, the entire process is aborted. The process is ending with an exception.
Message: "Thread is aborted"
Method:
private string ReadXml(XmlTextReader reader, string fileName)
{
string finalXML = "";
string s1 = "";
try
{
while (reader.Read())
{
switch (reader.NodeType)
{
case XmlNodeType.Element: // The node is an element.
s1 += "<" + reader.Name + ">";
break;
case XmlNodeType.Text: //Display the text in each element.
s1 += reader.Value;
break;
case XmlNodeType.EndElement: //Display the end of the element.
s1 += "</" + reader.Name + ">";
break;
}
finalXML = s1;
}
}
catch (Exception ex)
{
Logger.Logger.LogMessage(ex, "File Processing error: " + fileName);
}
reader.Close();
reader.Dispose();
return finalXML;
}
And then reading and desalinizing:
string finalXML = string.Empty;
XmlTextReader reader = new XmlTextReader(unzipfile);
finalXML = await ReadXml(reader, fileName);
var xmlremovenamespae = Helper.RemoveAllNamespaces(finalXML);
XmlParseObjectNew.BizData myxml = new XmlParseObjectNew.BizData();
using (StringReader sr = new StringReader(xmlremovenamespae))
{
XmlSerializer serializer = new XmlSerializer(typeof(XmlParseObjectNew.BizData));
myxml = (XmlParseObjectNew.BizData)serializer.Deserialize(sr);
}
Is there any better way to read & parse large xml file? need a suggestion.

I try this and working fine.
fileName = "your file path";
Try this code ,its parsing greater than 500MB XML file within few second.
using (TextReader textReader = new StreamReader(fileName))
{
using (XmlTextReader reader = new XmlTextReader(textReader))
{
reader.Namespaces = false;
XmlSerializer serializer = new XmlSerializer(typeof("YourXmlClassType"));
parseData = ("YourXmlClassType")serializer.Deserialize(reader);
}
}

The problem is, as mentioned by Jon Skeet and DiskJunky, that your dataset is simply too large to load into memory and your code not optimized for handling this. Hence why various classes are throwing you an 'out of memory exception'.
First of all, string concatenation. Using simple concatenation (a + b) with multiple strings is usually a bad idea due to the way strings work. I would recommend looking up online how to handle string concatenation effectively (for example, Jon Skeet's Concatenating Strings Efficiently).
However this is optimization of your code, the main issue is the sheer size of the XML file you are trying to load into memory. To handle large datasets it is usually better if you can 'stream' the data, processing chunks of data instead of the entire file.
As you have not shown an example of your XML, I took the liberty of making a simple example to illustrate what I mean.
Consider you have the following XML:
<root>
<specialelement>
<value1>somevalue</value1>
<value2>somevalue</value2>
</specialelement>
<specialelement>
<value1>someothervalue</value1>
<value2>someothervalue</value2>
</specialelement>
...
</root>
Of this XML you want to parse the specialelement into an object, with the following class definition:
[XmlRoot("specialelement")]
public class ExampleClass
{
[XmlElement(ElementName = "value1")]
public string Value1 { get; set; }
[XmlElement(ElementName = "value2")]
public string Value2 { get; set; }
}
I'll assume we can process each SpecialElement individually, and define a handler for this as follows:
public void HandleElement(ExampleClass item)
{
// Process stuff
}
Now we can use the XmlTextReader to read each element in the XML individually, when we reach our specialelement we keep track of the data that is contained within the XML element. When we reach the end of our specialelement we deserialize it into an object and send it to our handler for processing. For example:
using (var reader = new XmlTextReader( /* your inputstream */ ))
{
// Buffer for the element contents
StringBuilder sb = new StringBuilder(1000);
// Read till next node
while (reader.Read())
{
switch (reader.NodeType)
{
case XmlNodeType.Element:
// Clear the stringbuilder when we start with our element
if (string.Equals(reader.Name, "specialelement"))
{
sb.Clear();
}
// Append current element without namespace
sb.Append("<").Append(reader.Name).Append(">");
break;
case XmlNodeType.Text: //Display the text in each element.
sb.Append(reader.Value);
break;
case XmlNodeType.EndElement:
// Append the closure element
sb.Append("</").Append(reader.Name).Append(">");
// Check if we have finished reading our element
if (string.Equals(reader.Name, "specialelement"))
{
// The stringbuilder now contains the entire 'SpecialElement' part
using (TextReader textReader = new StringReader(sb.ToString()))
{
// Deserialize
var deserializedElement = (ExampleClass)serializer.Deserialize(textReader);
// Send to handler
HandleElement(deserializedElement);
}
}
break;
}
}
}
As we start processing the data as it comes in from the stream, we do not have to load the entire file into memory. Keeping the memory usage of the program low (preventing out-of-memory exceptions).
Checkout this fiddle to see it in action.
Note that this a quick example, there are still plenty of places where you can improve and optimize this code further.

Related

How to parse a xml document without a root node?

I have an xml document which has no root node. It looks like this:
<?xml version="1.0"?>
<Line>
<City>Paris</City>
<Country>France</Country>
</Line>
<Line>
<City>Lissabon</City>
<Country>Spain</Country>
</Line>
No I want to read Line by Line and write the contents to a database. However, XmlDocument seems to insist that there must exist a root node. How can I process this file?

If you want to parse it as an XML document, you can add a root node like Denis proposed in his comment.
If you would just like to read each line and write it to a database, you can handle the file like an ordinary (text) file and read its contents line by line using a StreamReader.
This would look something like this:
string line;
// Read the file and process it line by line.
var reader = new StreamReader(FILEPATH);
while((line = reader.ReadLine()) != null)
{
// Depending on what you need, you could strip the XML tags
// And write the line to the database
}
reader.Close();

You could try something like this (simple WinForms app with a button and a rich text box to display output for testing):
using System;
using System.Text;
using System.Xml;
using System.Windows.Forms;
namespace WindowsFormsApp11
{
public partial class Form1 : Form
{
public Form1()
{
InitializeComponent();
}
private void button1_Click(object sender, EventArgs e)
{
StringBuilder sb = new StringBuilder();
XmlReaderSettings settings = new XmlReaderSettings
{
ConformanceLevel = ConformanceLevel.Fragment
};
using (XmlReader reader = XmlReader.Create(#"c:\ab\countries.xml", settings))
{
while(reader.Read())
{
if (reader.Name != "Line") // Ignore the <Line> nodes
{
switch (reader.NodeType)
{
case XmlNodeType.Element:
sb.Append(string.Format("{0}:", reader.Name));
break;
case XmlNodeType.Text:
sb.Append(string.Format(" {0}{1}", reader.Value, Environment.NewLine));
break;
}
}
}
}
richTextBox1.Text = sb.ToString();
}
}
}

May be not the best solution, but you could create a List (or array) from your XML and insert missing nodes:
// Read lines into List
var list = File.ReadLines("doc.xml").ToList();
// Insert missing nodes
list.Insert(1, "<root>"); // Use 1, because 0 is XML directive
list.Insert(list.Count, "</root>"); //Add closing tag to the end
// Create final XML string with LINQ
var xml_str = list.Aggregate("", (acc, s) => acc + s);
// Having a string, we can create, for instance, XElement (or XDocument)
var xml = XElement.Parse(xml_str);
Console.WriteLine(xml.Element("Line").Element("City").Value);
//Output: Paris

Merging huge (2GB) XMLs in memory (without any memory exceptions)

I would like a C# code that optimally appends 2 XML strings. Both of them are of same schema. I tried StreamReader / StreamWriter; File.WriteAllText; FileStream
The problem I see is, it uses more than 98% of physical memory thus results in out of memory exception.
Is there a way to optimally merge without getting any memory exceptions? Time is not a concern for me.
If making it available in memory is going to be a problem, then what else could be better? Saving it on File system?
Further Details:
Here is my simple program: to provide better detail
static void Main(string[] args)
{
Program p = new Program();
XmlDocument x1 = new XmlDocument();
XmlDocument x2 = new XmlDocument();
x1.Load("C:\\XMLFiles\\1.xml");
x2.Load("C:\\XMLFiles\\2.xml");
List<string> files = new List<string>();
files.Add("C:\\XMLFiles\\1.xml");
files.Add("C:\\XMLFiles\\2.xml");
p.ConsolidateFiles(files, "C:\\XMLFiles\\Result.xml");
p.MergeFiles("C:\\XMLFiles\\Result.xml", x1.OuterXml, x2.OuterXml, "<Data>", "</Data>");
Console.ReadLine();
}
public void ConsolidateFiles(List<String> files, string outputFile)
{
var output = new StreamWriter(File.Open(outputFile, FileMode.Create));
output.WriteLine("<Data>");
foreach (var file in files)
{
var input = new StreamReader(File.Open(file, FileMode.Open));
string line;
while (!input.EndOfStream)
{
line = input.ReadLine();
if (!line.Contains("<Data>") &&
!line.Contains("</Data>"))
{
output.Write(line);
}
}
}
output.WriteLine("</Data>");
}
public void MergeFiles(string outputPath, string xmlState, string xmlFederal, string prefix, string suffix)
{
File.WriteAllText(outputPath, prefix);
File.AppendAllText(outputPath, xmlState);
File.AppendAllText(outputPath, xmlFederal);
File.AppendAllText(outputPath, suffix);
}
XML Sample:
<Data> </Data> is appended at the beginning & End
XML 1: <Sections> <Section></Section> </Sections>
XML 2: <Sections> <Section></Section> </Sections>
Merged: <Data> <Sections> <Section></Section> </Sections> <Sections> <Section></Section> </Sections> </Data>

Try this, a stream based approach which avoids loading all the xml into memory at once.
static void Main(string[] args)
{
List<string> files = new List<string>();
files.Add("C:\\XMLFiles\\1.xml");
files.Add("C:\\XMLFiles\\2.xml");
ConsolidateFiles(files, "C:\\XMLFiles\\Result.xml");
Console.ReadLine();
}
private static void ConsolidateFiles(List<String> files, string outputFile)
{
using (var output = new StreamWriter(outputFile))
{
output.WriteLine("<Data>");
foreach (var file in files)
{
using (var input = new StreamReader(file, FileMode.Open))
{
while (!input.EndOfStream)
{
string line = input.ReadLine();
if (!line.Contains("<Data>") &&
!line.Contains("</Data>"))
{
output.Write(line);
}
}
}
}
output.WriteLine("</Data>");
}
}
An even better approach is to use XmlReader (http://msdn.microsoft.com/en-us/library/system.xml.xmlreader(v=vs.90).aspx). This will give you a stream reader designed specifically for xml, rather than StreamReader which is for reading general text.

Take a look here
The answer given by Teoman Soygul seems to be what you're looking for.

This is untested, but I would do something along these lines using TextReader and TextWriter. You do not want to read all of the XML text into memory or store it in a string, and you do not want to use XElement/XDocument/etc. anywhere in the middle.
using (var writer = new XmlTextWriter("ResultFile.xml")
{
writer.WriteStartDocument();
writer.WriteStartElement("Data");
using (var reader = new XmlTextReader("XmlFile1.xml")
{
reader.Read();
while (reader.Read())
{
writer.WriteNode(reader, true);
}
}
using (var reader = new XmlTextReader("XmlFile2.xml")
{
reader.Read();
while (reader.Read())
{
writer.WriteNode(reader, true);
}
}
writer.WriteEndElement("Data");
}
Again no guarantees that this exact code will work as-is (or that it even compiles), but I think that is the idea you're looking for. Stream data from File1 first and write it directly out to the result file. Then, stream data from File2 and write it out. At no point should a full XML file be in memory.

If you run on 64bit, try this: go to your project properties -> build tab -> Platform target: change "Any CPU" to "x64".
This solved my problem for loading huge XML files in memory.

you have to go to file system, unless you have lots of RAM
one simple approach:
File.WriteAllText("output.xml", "<Data>");
File.AppendAllText("output.xml", File.ReadAllText("xml1.xml"));
File.AppendAllText("output.xml", File.ReadAllText("xml2.xml"));
File.AppendAllText("output.xml", "</Data>");
another:
var fNames = new[] { "xml1.xml", "xml2.xml" };
string line;
using (var writer = new StreamWriter("output.xml"))
{
writer.WriteLine("<Data>");
foreach (var fName in fNames)
{
using (var file = new System.IO.StreamReader(fName))
{
while ((line = file.ReadLine()) != null)
{
writer.WriteLine(line);
}
}
}
writer.WriteLine("</Data>");
}
All of this with the premise that there is not schema, or tags inside xml1.xml and xml2.xml
If that is the case, just code to omit them.

What is the most efficient way to take XML from API and store it locally?

I am trying to find the fastest way to read XML from the merriam webster dictionary, and store it to a local file for later use. Below, I try to implement a module which does a few things:
Read 2000 words from a local directory
Look up each of the words in the merriam dictionary using the API
Store the definition(s) in a local XML for later use.
Im not sure if making an XML is the best way to store this data, but it seemed like the simplest thing to do. At first, I thought I would do it in different steps. (1. Look up word, store word and definitions into data structure. 2. Dump all data into XML.) However, this poses a problem, because it just too much stuff to store on the runtime(call) stack.
So, in this scenario, I try to speed things up by looking up each word and then saving it to the xml one by one. This, however, is also a slow method. Its taking me up around 10 minutes per 500-600 words.
public void load_module() // stores words/definitions into xml file
{ // 1. Pick up word from text file 2. Look up word's definition 3. Store in Xml
string workdirect = Directory.GetCurrentDirectory();
workdirect = workdirect.Substring(0, workdirect.LastIndexOf("bin"));
workdirect += "words1.txt";
using (StreamReader read = new StreamReader(workdirect)) // 1. Pick up word from text file
{
while (!read.EndOfStream)
{
string line = read.ReadLine();
var definitions = load(line.ToLower()); // 2. Retrieve Words Definitions
store_xml(line, definitions);
wordlist.Add(line);
}
}
}
public List<string> load(string word)
{
XmlDocument doc = new XmlDocument();
List<string> definitions = new List<string>();
XmlNodeList node = null;
doc.Load("http://www.dictionaryapi.com/api/v1/references/collegiate/xml/"+word+"?key=*****************"); // Asteriks to hide the actual API key
if (doc.SelectSingleNode("entry_list").SelectSingleNode("entry").SelectSingleNode("def") == null)
{
return definitions;
}
node = doc.SelectSingleNode("entry_list").SelectSingleNode("entry").SelectSingleNode("def").SelectNodes("dt");
// TO DO : implement definitions if there is no node "def" in first node entry "entry_list"
foreach (XmlNode item in node)
{
definitions.Add(item.InnerXml.ToString().ToLower());
}
return definitions;
}
public void store_xml(string word, List<string> definitions)
{
string local = Directory.GetCurrentDirectory();
string name = "dictionary_word.xml";
local = local.Substring(0, local.LastIndexOf("bin"));
bool exists = File.Exists(local + name);
if (exists)
{
XmlDocument doc = new XmlDocument();
doc.Load(local + name);
XmlElement wordindoc = doc.CreateElement("Word");
wordindoc.SetAttribute("xmlns", word);
XmlElement defs = doc.CreateElement("Definitions");
foreach (var item in definitions)
{
XmlElement def = doc.CreateElement("Definition");
def.InnerText = item;
defs.AppendChild(def);
}
wordindoc.AppendChild(defs);
doc.DocumentElement.AppendChild(wordindoc);
doc.Save(local+name);
}
else
{
using (XmlWriter writer = XmlWriter.Create(#local + name))
{
writer.WriteStartDocument();
writer.WriteStartElement("Dictionary");
writer.WriteStartElement("Word", word);
writer.WriteStartElement("Definitions");
foreach (var def in definitions)
{
writer.WriteElementString("Definition", def);
}
writer.WriteEndElement();
writer.WriteEndElement();
writer.WriteEndElement();
writer.WriteEndDocument();
}
}
}
}

When handling large amounts of data that need to be exported to XML, I would normally keep the data in memory as a collection of custom objects rather than as an XMLDocument:
public class Definition
{
public string Word { get; set; }
public string Definition { get; set; }
}
I would then use XMLWriter to write the collection to the XML file:
XmlWriterSettings settings = new XmlWriterSettings();
settings.Indent = true;
settings.IndentChars = (" ");
settings.Encoding = Encoding.UTF8;
using (XmlWriter writer = XmlWriter.Create("C:\output\output.xml", settings))
{
writer.WriteStartDocument();
// TODO - use XMLWriter functions to write out each word and definition
writer.Flush();
}
If you are still short on memory, you might be able to write out the XML in batches (e.g. every 500 definitions).
I found the Microsoft article on Improving XML Performance a very useful reference, particularly the section on Design Considerations.

C# Read XML files in the Resources folder

I'm trying to read some xml files which I have included in the Resources folder under my project. Below is my code:
public void ReadXMLFile(int TFType)
{
XmlTextReader reader = null;
if (TFType == 1)
reader = new XmlTextReader(MyProject.Properties.Resources.ID01);
else if (TFType == 2)
reader = new XmlTextReader(MyProject.Properties.Resources.ID02);
while (reader.Read())
{
if (reader.IsStartElement())
{
switch (reader.Name)
{
case "Number":
// more coding on the cases.
}
But when I compile, there's an error on "QP2020E.Properties.Resources.ID01" saying: 'Illegal characters in path.' Do you guys know what's wrong?

The XmlTextReader constructor requires either a stream or a string. The one that requires a string is expecting a url (or path). You are passing it the value of your resource. You'll need to convert the string value into a stream.
To do this Wrap it in a StringReader(...)
reader = new XmlTextReader(new StringReader(MyProject.Properties.Resources.ID02));

You should provide the XMLTextReader with the file path not the file content. For instance, change
reader = new XmlTextReader(MyProject.Properties.Resources.ID01);
To:
StringReader s = new StringReader(MyProject.Properties.Resources.XmlFile);
XmlTextReader r = new XmlTextReader(s);

To read an XML file from a resource, use XDocument.Parse as described in this answer
I think you need to modify your code to be like this:
public void ReadXMLFile(int TFType)
{
XDocument doc = null;
if (TFType == 1)
doc = XDocument.Parse(MyProject.Properties.Resources.ID01);
else if (TFType == 2)
doc = XDocument.Parse(MyProject.Properties.Resources.ID02);
// Now use 'doc' as an XDocument object
}
More info on XDocument is here.

Read a xml element and write back the new value of element to xml in C#

I am trying to read abc.xml which has this element
<RunTimeStamp>
9/22/2011 2:58:34 PM
</RunTimeStamp>
I am trying to read the value of the element which the xml file has and store it in a string and once i am done with the processing. I get the current timestamp and write the new timestamp back to the xml file.
Here's my code so far, please help and guide, your help will be appreciated.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using log4net;
using System.Xml;
namespace TestApp
{
class TestApp
{
static void Main(string[] args)
{
Console.WriteLine("\n--- Starting the App --");
XmlTextReader reader = new XmlTextReader("abc.xml");
String DateVar = null;
while (reader.Read())
{
switch (reader.NodeType)
{
case XmlNodeType.Element: // The node is an element.
Console.Write("<" + reader.Name);
Console.WriteLine(">");
if(reader.Name.Equals("RunTimeStamp"))
{
DateVar = reader.Value;
}
break;
case XmlNodeType.Text: //Display the text in each element.
Console.WriteLine(reader.Value);
break;
/*
case XmlNodeType.EndElement: //Display the end of the element.
Console.Write("</" + reader.Name);
Console.WriteLine(">");
break;
*/
}
}
Console.ReadLine();
// after done with the processing.
XmlTextWriter writer = new XmlTextWriter("abc.xml", null);
}
}
}

I personally wouldn't use XmlReader etc here. I'd just load the whole file, preferrably with LINQ to XML:
XDocument doc = XDocument.Load("abc.xml");
XElement timestampElement = doc.Descendants("RunTimeStamp").First();
string value = (string) timestampElement;
// Then later...
timestampElement.Value = newValue;
doc.Save("abc.xml");
Much simpler!
Note that if the value is an XML-format date/time, you can cast to DateTime instead:
DateTime value = (DateTime) timestampElement;
then later:
timestampElement.Value = DateTime.UtcNow; // Or whatever
However, that will only handle valid XML date/time formats - otherwise you'll need to use DateTime.TryParseExact etc.

linq to xml is the best way to do it. Much simpler and easier as shown by #Jon

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Reading XML File ( File size > 500 MB) - c#

Related

How to parse a xml document without a root node?

Merging huge (2GB) XMLs in memory (without any memory exceptions)

What is the most efficient way to take XML from API and store it locally?

C# Read XML files in the Resources folder

Read a xml element and write back the new value of element to xml in C#

Categories

Resources