Currently I'm using the following code snippet to convert a .txt file with XML data to .CSV format. My question is this, currently this works perfectly with files that are around 100-200 mbs and the conversion time is very low (1-2 minutes max), However I now need this to work for much bigger files (1-2 GB's each file). Currently the program freezes the computer and the conversion takes about 30-40 minutes with this function. Not sure how I would proceed changing this function. Any help will be appreciated!
string all_lines = File.ReadAllText(p);
all_lines = "<Root>" + all_lines + "</Root>";
XmlDocument doc_all = new XmlDocument();
doc_all.LoadXml(all_lines);
StreamWriter write_all = new StreamWriter(FILENAME1);
XmlNodeList rows_all = doc_all.GetElementsByTagName("XML");
foreach (XmlNode rowtemp in rows_all)
{
List<string> children_all = new List<string>();
foreach (XmlNode childtemp in rowtemp.ChildNodes)
{
children_all.Add(Regex.Replace(childtemp.InnerText, "\\s+", " "));
}
write_all.WriteLine(string.Join(",", children_all.ToArray()));
}
write_all.Flush();
write_all.Close();
Sample Input::
<XML><DSTATUS>1,4,7,,5</DSTATUS><EVENT> hello,there,my,name,is,jack,</EVENT>
last,name,missing,above <ANOTHERTAG>3,6,7,,8,4</ANOTHERTAG> </XML>
<XML><DSTATUS>1,5,7,,3</DSTATUS><EVENT>hello,there,my,name,is,mary,jane</EVENT>
last,name,not,missing,above<ANOTHERTAG>3,6,7,,8,4</ANOTHERTAG></XML>
Sample Output::
1,4,7,,5,hello,there,my,name,is,jack,,last,name,missing,above,3,6,7,,8,4
1,5,7,,3,hello,there,my,name,is,mary,jane,last,name,not,missing,above,3,6,7,,8,4
You need to take a streaming approach, as you're currently reading the entire 2Gb file into memory and then processing it. You should read a bit of XML, write a bit of CSV and keep doing that until you've processed it all.
A possible solution is below:
using (var writer = new StreamWriter(FILENAME1))
{
foreach (var element in StreamElements(r, "XML"))
{
var values = element.DescendantNodes()
.OfType<XText>()
.Select(e => Regex.Replace(e.Value, "\\s+", " "));
var line = string.Join(",", values);
writer.WriteLine(line);
}
}
Where StreamElements is inspired by Jon Skeet's streaming of XElements from an XmlReader in an answer to this question. I've made some changes to support your 'invalid' XML (as you have no root element):
private static IEnumerable<XElement> StreamElements(string fileName, string elementName)
{
var settings = new XmlReaderSettings
{
ConformanceLevel = ConformanceLevel.Fragment
};
using (XmlReader reader = XmlReader.Create(fileName, settings))
{
while (reader.Read())
{
if (reader.NodeType == XmlNodeType.Element)
{
if (reader.Name == elementName)
{
var el = XNode.ReadFrom(reader) as XElement;
if (el != null)
{
yield return el;
}
}
}
}
}
}
If you're prepared to consider a completely different way of doing it, download Saxon-EE 9.6, get an evaluation license, and run the following streaming XSLT 3.0 code:
<xsl:stylesheet version="3.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:template name="main">
<xsl:stream href="input.xml">
<xsl:for-each select="*/*">
<xsl:value-of select="*!normalize-space()" separator=","/>
<xsl:text>
</xsl:text>
</xsl:for-each>
</xsl:stream>
</xsl:template>
</xsl:stylesheet>
It freezes because of File.ReadAllText(p);
Do not read the complete file into memory. (This will first start swapping, then halt your CPU because no more memory is available)
Use a chunking approach: Read line by line, convert line by line, write line by line.
Use some lower level XML Reader class, not XmlDocument
There are two variants. First is to hide program-freeze, use BackgroundWorker for it.
Second: read your text file string-by-string, use any Reader for it (Xml or any text\file).
You can combine these variants.
Related
I have an Xml file as:
<?xml version="1.0"?>
<hashnotes>
<hashtags>
<hashtag>#birthday</hashtag>
<hashtag>#meeting</hashtag>
<hashtag>#anniversary</hashtag>
</hashtags>
<lastid>0</lastid>
<Settings>
<Font>Arial</Font>
<HashtagColor>red</HashtagColor>
<passwordset>0</passwordset>
<password></password>
</Settings>
</hashnotes>
I then call a function to add a node in the xml,
The function is :
public static void CreateNoteNodeInXDocument(XDocument argXmlDoc, string argNoteText)
{
string lastId=((Convert.ToInt32(argXmlDoc.Root.Element("lastid").Value)) +1).ToString();
string date = DateTime.Now.ToString("MM/dd/yyyy");
argXmlDoc.Element("hashnotes").Add(new XElement("Note", new XAttribute("ID", lastId), new XAttribute("Date",date),new XElement("Text", argNoteText)));
//argXmlDoc.Root.Note.Add new XElement("Text", argNoteText)
List<string> hashtagList = Utilities.GetHashtagsFromText(argNoteText);
XElement reqNoteElement = (from xml2 in argXmlDoc.Descendants("Note")
where xml2.Attribute("ID").Value == lastId
select xml2).FirstOrDefault();
if (reqNoteElement != null)
{
foreach (string hashTag in hashtagList)
{
reqNoteElement.Add(new XElement("hashtag", hashTag));
}
}
argXmlDoc.Root.Element("lastid").Value = lastId;
}
After this I save the xml.
Next time when I try to load the Xml, it fails with an exception:
System.Xml.XmlException: Unexpected XML declaration. The XML declaration must be the first node in the document, and no white space characters are allowed to appear before it.
Here is the code to load the XML:
private static XDocument hashNotesXDocument;
private static Stream hashNotesStream;
StorageFile hashNoteXml = await InstallationFolder.GetFileAsync("hashnotes.xml");
hashNotesStream = await hashNoteXml.OpenStreamForWriteAsync();
hashNotesXDocument = XDocument.Load(hashNotesStream);
and I save it using:
hashNotesXDocument.Save(hashNotesStream);
You don't show all of your code, but it looks like you open the XML file, read the XML from it into an XDocument, edit the XDocument in memory, then write back to the opened stream. Since the stream is still open it will be positioned at the end of the file and thus the new XML will be appended to the file.
Suggest eliminating hashNotesXDocument and hashNotesStream as static variables, and instead open and read the file, modify the XDocument, then open and write the file using the pattern shown here.
I'm working only on desktop code (using an older version of .Net) so I can't test this, but something like the following should work:
static async Task LoadUpdateAndSaveXml(Action<XDocument> editor)
{
XDocument doc;
var xmlFile = await InstallationFolder.GetFileAsync("hashnotes.xml");
using (var reader = new StreamReader(await xmlFile.OpenStreamForReadAsync()))
{
doc = XDocument.Load(reader);
}
if (doc != null)
{
editor(doc);
using (var writer = new StreamWriter(await xmlFile.OpenStreamForWriteAsync()))
{
// Truncate - https://stackoverflow.com/questions/13454584/writing-a-shorter-stream-to-a-storagefile
if (writer.CanSeek && writer.Length > 0)
writer.SetLength(0);
doc.Save(writer);
}
}
}
Also, be sure to create the file before using it.
I would like a C# code that optimally appends 2 XML strings. Both of them are of same schema. I tried StreamReader / StreamWriter; File.WriteAllText; FileStream
The problem I see is, it uses more than 98% of physical memory thus results in out of memory exception.
Is there a way to optimally merge without getting any memory exceptions? Time is not a concern for me.
If making it available in memory is going to be a problem, then what else could be better? Saving it on File system?
Further Details:
Here is my simple program: to provide better detail
static void Main(string[] args)
{
Program p = new Program();
XmlDocument x1 = new XmlDocument();
XmlDocument x2 = new XmlDocument();
x1.Load("C:\\XMLFiles\\1.xml");
x2.Load("C:\\XMLFiles\\2.xml");
List<string> files = new List<string>();
files.Add("C:\\XMLFiles\\1.xml");
files.Add("C:\\XMLFiles\\2.xml");
p.ConsolidateFiles(files, "C:\\XMLFiles\\Result.xml");
p.MergeFiles("C:\\XMLFiles\\Result.xml", x1.OuterXml, x2.OuterXml, "<Data>", "</Data>");
Console.ReadLine();
}
public void ConsolidateFiles(List<String> files, string outputFile)
{
var output = new StreamWriter(File.Open(outputFile, FileMode.Create));
output.WriteLine("<Data>");
foreach (var file in files)
{
var input = new StreamReader(File.Open(file, FileMode.Open));
string line;
while (!input.EndOfStream)
{
line = input.ReadLine();
if (!line.Contains("<Data>") &&
!line.Contains("</Data>"))
{
output.Write(line);
}
}
}
output.WriteLine("</Data>");
}
public void MergeFiles(string outputPath, string xmlState, string xmlFederal, string prefix, string suffix)
{
File.WriteAllText(outputPath, prefix);
File.AppendAllText(outputPath, xmlState);
File.AppendAllText(outputPath, xmlFederal);
File.AppendAllText(outputPath, suffix);
}
XML Sample:
<Data> </Data> is appended at the beginning & End
XML 1: <Sections> <Section></Section> </Sections>
XML 2: <Sections> <Section></Section> </Sections>
Merged: <Data> <Sections> <Section></Section> </Sections> <Sections> <Section></Section> </Sections> </Data>
Try this, a stream based approach which avoids loading all the xml into memory at once.
static void Main(string[] args)
{
List<string> files = new List<string>();
files.Add("C:\\XMLFiles\\1.xml");
files.Add("C:\\XMLFiles\\2.xml");
ConsolidateFiles(files, "C:\\XMLFiles\\Result.xml");
Console.ReadLine();
}
private static void ConsolidateFiles(List<String> files, string outputFile)
{
using (var output = new StreamWriter(outputFile))
{
output.WriteLine("<Data>");
foreach (var file in files)
{
using (var input = new StreamReader(file, FileMode.Open))
{
while (!input.EndOfStream)
{
string line = input.ReadLine();
if (!line.Contains("<Data>") &&
!line.Contains("</Data>"))
{
output.Write(line);
}
}
}
}
output.WriteLine("</Data>");
}
}
An even better approach is to use XmlReader (http://msdn.microsoft.com/en-us/library/system.xml.xmlreader(v=vs.90).aspx). This will give you a stream reader designed specifically for xml, rather than StreamReader which is for reading general text.
Take a look here
The answer given by Teoman Soygul seems to be what you're looking for.
This is untested, but I would do something along these lines using TextReader and TextWriter. You do not want to read all of the XML text into memory or store it in a string, and you do not want to use XElement/XDocument/etc. anywhere in the middle.
using (var writer = new XmlTextWriter("ResultFile.xml")
{
writer.WriteStartDocument();
writer.WriteStartElement("Data");
using (var reader = new XmlTextReader("XmlFile1.xml")
{
reader.Read();
while (reader.Read())
{
writer.WriteNode(reader, true);
}
}
using (var reader = new XmlTextReader("XmlFile2.xml")
{
reader.Read();
while (reader.Read())
{
writer.WriteNode(reader, true);
}
}
writer.WriteEndElement("Data");
}
Again no guarantees that this exact code will work as-is (or that it even compiles), but I think that is the idea you're looking for. Stream data from File1 first and write it directly out to the result file. Then, stream data from File2 and write it out. At no point should a full XML file be in memory.
If you run on 64bit, try this: go to your project properties -> build tab -> Platform target: change "Any CPU" to "x64".
This solved my problem for loading huge XML files in memory.
you have to go to file system, unless you have lots of RAM
one simple approach:
File.WriteAllText("output.xml", "<Data>");
File.AppendAllText("output.xml", File.ReadAllText("xml1.xml"));
File.AppendAllText("output.xml", File.ReadAllText("xml2.xml"));
File.AppendAllText("output.xml", "</Data>");
another:
var fNames = new[] { "xml1.xml", "xml2.xml" };
string line;
using (var writer = new StreamWriter("output.xml"))
{
writer.WriteLine("<Data>");
foreach (var fName in fNames)
{
using (var file = new System.IO.StreamReader(fName))
{
while ((line = file.ReadLine()) != null)
{
writer.WriteLine(line);
}
}
}
writer.WriteLine("</Data>");
}
All of this with the premise that there is not schema, or tags inside xml1.xml and xml2.xml
If that is the case, just code to omit them.
I have an XML file with no root. I cannot change this. I am trying to parse it, but XDocument.Load won't do it. I have tried to set ConformanceLevel.Fragment, but I still get an exception thrown. Does anyone have a solution to this?
I tried with XmlReader, but things are messed up and can't get it work right. XDocument.Load works great, but if I have a file with multiple roots, it doesn't.
XmlReader itself does support reading of xml fragment - i.e.
var settings = new XmlReaderSettings { ConformanceLevel = ConformanceLevel.Fragment };
using (var reader = XmlReader.Create("fragment.xml", settings))
{
// you can work with reader just fine
}
However XDocument.Load does not support reading of fragmented xml.
Quick and dirty way is to wrap the nodes under one virtual root before you invoke the XDocument.Parse. Like:
var fragments = File.ReadAllText("fragment.xml");
var myRootedXml = "<root>" + fragments + "</root>";
var doc = XDocument.Parse(myRootedXml);
This approach is limited to small xml files - as you have to read file into memory first; and concatenating large string means moving large objects in memory - which is best avoided.
If performance matters you should be reading nodes into XDocument one-by-one via XmlReader as explained in excellent #Martin-Honnen 's answer (https://stackoverflow.com/a/18203952/2440262)
If you use API that takes for granted that XmlReader iterates over valid xml, and performance matters, you can use joined-stream approach instead:
using (var jointStream = new MultiStream())
using (var openTagStream = new MemoryStream(Encoding.ASCII.GetBytes("<root>"), false))
using (var fileStream =
File.Open(#"fragment.xml", FileMode.Open, FileAccess.Read, FileShare.Read))
using (var closeTagStream = new MemoryStream(Encoding.ASCII.GetBytes("</root>"), false))
{
jointStream.AddStream(openTagStream);
jointStream.AddStream(fileStream);
jointStream.AddStream(closeTagStream);
using (var reader = XmlReader.Create(jointStream))
{
// now you can work with reader as if it is reading valid xml
}
}
MultiStream - see for example https://gist.github.com/svejdo1/b9165192d313ed0129a679c927379685
Note: XDocument loads the whole xml into memory. So don't use it for large files - instead use XmlReader for iteration and load just the crispy bits as XElement via XNode.ReadFrom(...)
The only in-memory tree representations in the .NET framework that can deal with fragments are the XmlDocumentFragment in .NET's DOM implementation so you would need to create an XmlDocument and a fragment with e.g.
XmlDocument doc = new XmlDocument();
XmlDocumentFragment frag = doc.CreateDocumentFragment();
frag.InnerXml = stringWithXml; // for instance
// frag.InnerXml = File.ReadAllText("fragment.xml");
or is XPathDocument where you can create one using an XmlReader with ConformanceLevel set to Fragment:
XPathDocument doc;
using (XmlReader xr =
XmlReader.Create("fragment.xml",
new XmlReaderSettings()
{
ConformanceLevel = ConformanceLevel.Fragment
}))
{
doc = new XPathDocument(xr);
}
// new create XPathNavigator for read out data e.g.
XPathNavigator nav = doc.CreateNavigator();
Obviously XPathNavigator is read-only.
If you want to use LINQ to XML then I agree with the suggestions made that you need to create an XElement as a wrapper. Instead of pulling in a string with the file contents you could however use XNode.ReadFrom with an XmlReader e.g.
public static class MyExtensions
{
public static IEnumerable<XNode> ParseFragment(XmlReader xr)
{
xr.MoveToContent();
XNode node;
while (!xr.EOF && (node = XNode.ReadFrom(xr)) != null)
{
yield return node;
}
}
}
then
XElement root = new XElement("root",
MyExtensions.ParseFragment(XmlReader.Create(
"fragment.xml",
new XmlReaderSettings() {
ConformanceLevel = ConformanceLevel.Fragment })));
That might work better and more efficiently than reading everything into a string.
If you wanted to use XmlDocument.Load() then you would need to wrap the content in a root node.
or you could try something like this...
while (xmlReader.Read())
{
if (xmlReader.NodeType == XmlNodeType.Element)
{
XmlDocument d = new XmlDocument();
d.CreateElement().InnerText = xmlReader.ReadOuterXml();
}
}
XML document cannot have more than one root elements. One root element is required. You may do one thing. Get all the fragment elements and wrap them into a root element and parse it with XDocument.
This would be the best and easiest approach that one could think of.
I've been having problems with the indentation of my XML files. Everytime I load them from a certain server, the XML nodes all jumble up on a few lines. I want to write a quick application to indent the nodes properly. That is:
<name>Bob<name>
<age>24</age>
<address>
<stnum>2</stnum>
<street>herp derp st</street>
</address>
currently it's coming out as :
<name>bob</name><age>24</age>
<address>
<stnum>2</stnum><street>herp derp st</street>
</address>
since I can't touch the internal program that gives me these xml files and re-indenting them without a program would take ages, I wanted to write up a quick program to do this for me. When I use the XMLdocument library stuff, it only reads the information of the nodes. So my question is, whats a good way to read the file, line by line and then reindenting it for me. All xml nodes are the same.
Thanks.
You can use the XmlTextWritter class. More specifically the .Formatting = Formatting.Indented.
Here is some sample code I found on this blog post.
http://www.yetanotherchris.me/home/2009/9/9/formatting-xml-in-c.html
public static string FormatXml(string inputXml)
{
XmlDocument document = new XmlDocument();
document.Load(new StringReader(inputXml));
StringBuilder builder = new StringBuilder();
using (XmlTextWriter writer = new XmlTextWriter(new StringWriter(builder)))
{
writer.Formatting = Formatting.Indented;
document.Save(writer);
}
return builder.ToString();
}
With LINQ to XML, it's basically a one-liner:
public static string Reformat(string xml)
{
return XDocument.Parse(xml).ToString();
}
Visual Studio or any decent XML editor will format (tabify) XML documents easily. There are also on-line tools available:
http://www.xmlformatter.net/
http://www.shell-tools.net/index.php?op=xml_format
If you are using Visual studio just open xml do Ctrl+a Ctrl+k Ctrl+F and that's it for formatting.
You can also use XSLT:
// This XSLT copies everything but idented
StringReader sr = new StringReader( xsl );
XmlReader reader = XmlReader.Create(sr);
XslTransform xslt = new XslTransform();
xslt.Load(reader);
xslt.Transform(xmlFileUnidentedPath, xmlFileIdentedPath);
Having xsl defined as:
string xsl = #"
<?xml version=""1.0""?>
<xsl:stylesheet version=""1.0"" xmlns:xsl=""http://www.w3.org/1999/XSL/Transform"">
<xsl:output method=""xml"" omit-xml-declaration=""no"" indent=""yes"" encoding=""US-SCII""/>
<xsl:strip-space elements=""*""/>
<xsl:template match=""/"">
<xsl:copy-of select="".""/>
</xsl:template>
</xsl:stylesheet>";
I've written a Task Scheduling program for learning purposes. Currently I'm saving the scheduled tasks just as plain text and then parsing it using Regex. This looks messy (code wise) and is not very coherent.
I would like to load the scheduled tasks from an XML file instead, I've searched quite a bit to find some solutions but I couldn't get it to work how I wanted.
I wrote an XML file structured like this to store my data in:
<Tasks>
<Task>
<Name>Shutdown</Name>
<Location>C:/WINDOWS/system32/shutdown.exe</Location>
<Arguments>-s -f -t 30</Arguments>
<RunWhen>
<Time>8:00:00 a.m.</Time>
<Date>18/03/2011</Date>
<Days>
<Monday>false</Monday>
<Tuesday>false</Tuesday>
<Wednesday>false</Wednesday>
<Thursday>false</Thursday>
<Friday>false</Friday>
<Saturday>false</Saturday>
<Sunday>false</Sunday>
<Everyday>true</Everyday>
<RunOnce>false</RunOnce>
</Days>
</RunWhen>
<Enabled>true</Enabled>
</Task>
</Tasks>
The way I'd like to parse the data is like so:
Open Tasks.xml
Load the first Task tag.
In that task retrieve the values of the Name, Location and Arguments tags.
Then open the RunWhen tag and retrieve the values of the Time and Date tags.
After that open the Days tag and retrieve the value of each individual tag within.
Retrieve the value of Enabled.
Load the next task and repeat steps 3 -> 7 until all the Task tags in Tasks have been parsed.
I'm very sure you can do it this way I just can't work it out as there are so many different ways to do things in XML I got a bit overwhelmed. But what I've go so far is that I would most likely be using XPathDocument and XPathNodeIterator right?
If someone can show me an example or explain to me how this would be done I would be very happy.
Easy way to parse the xml is to use the LINQ to XML
for example you have the following xml file
<library>
<track id="1" genre="Rap" time="3:24">
<name>Who We Be RMX (feat. 2Pac)</name>
<artist>DMX</artist>
<album>The Dogz Mixtape: Who's Next?!</album>
</track>
<track id="2" genre="Rap" time="5:06">
<name>Angel (ft. Regina Bell)</name>
<artist>DMX</artist>
<album>...And Then There Was X</album>
</track>
<track id="3" genre="Break Beat" time="6:16">
<name>Dreaming Your Dreams</name>
<artist>Hybrid</artist>
<album>Wide Angle</album>
</track>
<track id="4" genre="Break Beat" time="9:38">
<name>Finished Symphony</name>
<artist>Hybrid</artist>
<album>Wide Angle</album>
</track>
<library>
For reading this file, you can use the following code:
public void Read(string fileName)
{
XDocument doc = XDocument.Load(fileName);
foreach (XElement el in doc.Root.Elements())
{
Console.WriteLine("{0} {1}", el.Name, el.Attribute("id").Value);
Console.WriteLine(" Attributes:");
foreach (XAttribute attr in el.Attributes())
Console.WriteLine(" {0}", attr);
Console.WriteLine(" Elements:");
foreach (XElement element in el.Elements())
Console.WriteLine(" {0}: {1}", element.Name, element.Value);
}
}
I usually use XmlDocument for this. The interface is pretty straight forward:
var doc = new XmlDocument();
doc.LoadXml(xmlString);
You can access nodes similar to a dictionary:
var tasks = doc["Tasks"];
and loop over all children of a node.
Try XmlSerialization
try this
[Serializable]
public class Task
{
public string Name{get; set;}
public string Location {get; set;}
public string Arguments {get; set;}
public DateTime RunWhen {get; set;}
}
public void WriteXMl(Task task)
{
XmlSerializer serializer;
serializer = new XmlSerializer(typeof(Task));
MemoryStream stream = new MemoryStream();
StreamWriter writer = new StreamWriter(stream, Encoding.Unicode);
serializer.Serialize(writer, task);
int count = (int)stream.Length;
byte[] arr = new byte[count];
stream.Seek(0, SeekOrigin.Begin);
stream.Read(arr, 0, count);
using (BinaryWriter binWriter=new BinaryWriter(File.Open(#"C:\Temp\Task.xml", FileMode.Create)))
{
binWriter.Write(arr);
}
}
public Task GetTask()
{
StreamReader stream = new StreamReader(#"C:\Temp\Task.xml", Encoding.Unicode);
return (Task)serializer.Deserialize(stream);
}
Are you familiar with the DataSet class?
The DataSet can also load XML documents and you may find it easier to iterate.
http://msdn.microsoft.com/en-us/library/system.data.dataset.readxml.aspx
DataSet dt = new DataSet();
dt.ReadXml(#"c:\test.xml");
class Program
{
static void Main(string[] args)
{
//Load XML from local
string sourceFileName="";
string element=string.Empty;
var FolderPath=#"D:\Test\RenameFileWithXmlAttribute";
string[] files = Directory.GetFiles(FolderPath, "*.xml");
foreach (string xmlfile in files)
{
try
{
sourceFileName = xmlfile;
XElement xele = XElement.Load(sourceFileName);
string convertToString = xele.ToString();
XElement parseXML = XElement.Parse(convertToString);
element = parseXML.Descendants("Meta").Where(x => (string)x.Attribute("name") == "XMLTAG").Last().Value;
DirectoryInfo CurrentDate = Directory.CreateDirectory(DateTime.Now.ToString("yyyy-MM-dd"));
string saveWithThisName= Path.Combine(CurrentDate.FullName, element);
File.Copy(sourceFileName, saveWithThisName,true);
}
catch(Exception ex)
{
}
}
}
}