Splitting a large XML file in two using C# console app

Splitting a large XML file in two using C# console app - c#

I need to split am XML file (~400 MB) in two, so that a legacy app can process the file. At the moment its throwing an exception when the file is over around 300 MB.
As I can't change the app which is doing the processing, I thought I could write a console app to split the file in two first. What's the best way of doing this? It needs to be automated so I can't use a text editor, and I'm using C#.
I suppose the considerations are:
writing a header to the new files after the split
finding a good place to split (not in middle of 'object')
closing off tags and file correctly in first file, opening tags correctly in second file
Any suggestions?

The "best" way is likely to be based on XmlReader and XmlWriter. Using these "streaming" APIs avoids needing to load the whole XML object model in memory (and with DOM –XmlDocument– that can need considerably more memory than the text data).
Using these APIs is harder than just loading the document: your implementation needs to track the context (eg. current node and ancestor list), but in this case that wouldn't be complex (just enough to open the elements to the current state when opening each output document).

You might want to consider making a full copy of the file and then deleting elements from each. You will have to decide at what level the deletions could occur.
It should then be fairly straightforward, from a count of how many elements have been deleted from FileA, to identify how many (and from what starting point) should be deleted from FileB.
Is that feasible for your circumstance?
I have put together the following to describe my thinking. It is not tested, but I would value the comments of the group. Downvote me if you want but I would prefer constructive criticism.
using System.Xml;
using System.Xml.Schema;
namespace ConsoleApplication1
{
class Program
{
static void Main(string[] args)
{
SplitXML(args[0], args[1]);
}
private static void SplitXML(string fileNameA, string fileNameB)
{
int deleteCount;
XmlNodeList childNodes;
XmlReader reader;
XmlTextWriter writer;
XmlDocument doc;
// ------------- Process FileA
reader = XmlReader.Create(fileNameA);
doc = new XmlDocument();
doc.Load(reader);
childNodes = doc.DocumentElement.ChildNodes;
deleteCount = childNodes.Count / 2;
for (int i = 0; i < deleteCount; i++)
{
doc.DocumentElement.RemoveChild(childNodes.Item(0));
}
writer = new XmlTextWriter("FileC", null);
doc.Save(writer);
// ------------- Process FileB
reader = XmlReader.Create(fileNameB);
doc = new XmlDocument();
doc.Load(reader);
childNodes = doc.DocumentElement.ChildNodes;
for (int i = deleteCount + 1; i < childNodes.Count; i++)
{
doc.DocumentElement.RemoveChild(childNodes.Item(deleteCount +1));
}
writer = new XmlTextWriter("FileD", null);
doc.Save(writer);
}
}
}

If it's pure C#, running it as a 64-bit process might solve the problem for no effort at all (assuming you have a 64-bit Windows at hand).

Related

Inserting word content into a VSTO document level customization

I have a VSTO document level customization that performs specific functionality when opened from within our application. Basically, we open normal documents from inside of our application and I copy the content from the normal docx file into the VSTO document file which is stored inside of our database.
var app = new Microsoft.Office.Interop.Word.Application();
var docs = app.Documents;
var vstoDoc = docs.Open(vstoDocPath);
var doc = docs.Open(currentDocPath);
doc.Range().Copy();
vstoDoc.Range().PasteAndFormat(WdRecoveryType.wdFormatOriginalFormatting);
Everything works great, however using the above code leaves out certain formatting related to the document. The code below fixes these issues, but there will most likely be more issues that I come across, as I come across them I could address them one by one ...
for (int i = 0; i < doc.Sections.Count; i++)
{
var footerFont = doc.Sections[i + 1].Footers.GetEnumerator();
var headerFont = doc.Sections[i + 1].Headers.GetEnumerator();
var footNoteFont = doc.Footnotes.GetEnumerator();
foreach (HeaderFooter foot in vstoDoc.Sections[i + 1].Footers)
{
footerFont.MoveNext();
foot.Range.Font.Name = ((HeaderFooter)footerFont.Current).Range.Font.Name;
}
foreach (HeaderFooter head in vstoDoc.Sections[i + 1].Headers)
{
headerFont.MoveNext();
head.Range.Font.Name = ((HeaderFooter)headerFont.Current).Range.Font.Name;
}
foreach (Footnote footNote in vstoDoc.Footnotes)
{
footNoteFont.MoveNext();
footNote.Range.Font.Name = ((Footnote)footNoteFont.Current).Range.Font.Name;
}
}
I need a fool proof safe way of copying the content of one docx file to another docx file while preserving formatting and eliminating the risk of corrupting the document. I've tried to use reflection to set the properties of the two documents to one another, the code does start to look a bit ugly and I always worry that certain properties that I'm setting may have undesirable side effects. I've also tried zipping and unzipping the docx files, editing the xml manually and then rezipping afterwards, this hasn't worked too well, I've ended up corrupting a few of the documents during this process.
If anyone has dealt with a similar issue in the past, please could you point me in the right direction.
Thank you for your time

This code copies and keeps source formatting.
bookmark.Range.Copy();
Document newDocument = WordInstance.Documents.Add();
newDocument.Activate();
newDocument.Application.CommandBars.ExecuteMso("PasteSourceFormatting");

There is one more elegant way to manage it based upon
Globals.ThisAddIn.Application.ActiveDocument.Range().ImportFragment(filePath);
or you can do the following
Globals.ThisAddIn.Application.Selection.Range.ImportFragment(filePath);
in order to obtain current range where filePath is a path to the document you are copping from.

Read entire elements from an XML network stream

I am writing a network server in C# .NET 4.0. There is a network TCP/IP connection over which I can receive complete XML elements. They arrive regularly and I need to process them immediately. Each XML element is a complete XML document in itself, so it has an opening element, several sub-nodes and a closing element. There is no single root element for the entire stream. So when I open the connection, what I get is like this:
<status>
<x>123</x>
<y>456</y>
</status>
Then some time later it continues:
<status>
<x>234</x>
<y>567</y>
</status>
And so on. I need a way to read the complete XML string until a status element is complete. I don't want to do that with plain text reading methods because I don't know in what formatting the data arrives. I can in no way wait until the entire stream is finished, as is often described elsewhere. I have tried using the XmlReader class but its documentation is weird, the methods don't work out, the first element is lost and after sending the second element, an XmlException occurs because there are two root elements.

Try this:
var settings = new XmlReaderSettings
{
ConformanceLevel = ConformanceLevel.Fragment
};
using (var reader = XmlReader.Create(stream, settings))
{
while (!reader.EOF)
{
reader.MoveToContent();
var doc = XDocument.Load(reader.ReadSubtree());
Console.WriteLine("X={0}, Y={1}",
(int)doc.Root.Element("x"),
(int)doc.Root.Element("y"));
reader.ReadEndElement();
}
}

If you change the "conformance level" to "fragment", it might work with the XmlReader.
This is a (slightly modified) example from MSDN:
XmlReaderSettings settings = new XmlReaderSettings();
settings.ConformanceLevel = ConformanceLevel.Fragment;
XmlReader reader = XmlReader.Create(streamOfXmlFragments, settings);

You could use XElement.Load which is meant more for streaming of Xml Element fragments that is new in .net 3.5 and also supports reading directly from a stream.
Have a look at System.Xml.Linq
I think that you may well still have to add some control logic so as to partition the messages you are receiving, but you may as well give it a go.

I'm not sure there's anything built-in that does that.
I'd open a string builder, fill it until I see a </status> tag, and then parse it using the ordinary XmlDocument.

Not substantially different from dtb's solution, but linqier
static IEnumerable<XDocument> GetDocs(Stream xmlStream)
{
var xmlSettings = new XmlReaderSettings() { ConformanceLevel = ConformanceLevel.Fragment };
using (var xmlReader = XmlReader.Create(xmlStream, xmlSettings))
{
var xmlPathNav = new XPathDocument(xmlReader).CreateNavigator();
foreach (var selectee in xmlPathNav.Select("/*").OfType<XPathNavigator>())
yield return XDocument.Load(selectee.ReadSubtree());
}
}
I ran into a similar problem in PowerShell, but the asker's question was in C#, so I've attempted to translate it (and verified that it works). Here is where I found the clue that got me over the last little bumps (". . .The way the XPathDocument does its magic is by creating a “transparent” root node, and holding the fragments from it. I say it’s transparent because your XPath queries can use the root node axis and still get properly resolved to the fragments. . .")
The fragments of XML I'm working with happen to be smallish. If you had bigger chunks, you'd probably want to look into XStreamingElement - it can add a lot of complexity but also greatly decrease memory usage when dealing with large volumes of XML.

Using the SharpSVN api are there any methods available to get the number of lines contained in a file at a Revision without Exporting it?

I was just wondering if I missed anything inside the documentation that would allow me to get the number of lines contained in a file at a certain revision (or even number of lines changed from a SvnChangeItem, that would be nice too) without having to directly export the file to the filesystem and parse through it counting each line.
Any help would be appreciated. Thanks.

Nope, your stuck with exactly the solution you named. Export to temp file, count the lines, delete the file. A fairly expensive operation if your doing this file-by-file. It may be better to fetch the entire repo if you need to line-count every file and reuse the working directory for future runs.

The meta data (like current line count) is not contained within the repository but you can get the file without doing messy temp files.
For brevity, excluded code to iterate over revisions etc.
using (var client = new SvnClient())
{
using (MemoryStream memoryStream = new MemoryStream())
{
client.Write(new SvnUriTarget(urlToFile), memoryStream);
memoryStream.Position = 0;
var streamReader = new StreamReader(memoryStream);
int lineCount = 0;
while (streamReader.ReadLine() != null)
{
lineCount++;
}
}
}

xml and files on disc interaction

Its been a while since i've needed to do this so i was looking at the old school methods of writing XMLDocument from code down to a File.
In my application i am writing alot to an XMLdocument with new elements and values and periodically saving it down to disc and also reading from the file and depending on the data i am doing things.
I am using methods like File.Exists(...) _xmldoc.LoadFile(..) etc...
Im wondering probably now a days there are better methods for this with regards
Parsing the XML
Checking its format for saving down
rather then the data being saved down being treated as text but as XML
maybe what i am doing is fine but its been a while and wondered if there are other methods :)
thanks

Well there's LINQ to XML which is a really nice XML API introduced in .NET 3.5. I don't think the existing XMLDocument API has changed much, beyond some nicer ways of creating XmlReaders and XmlWriters (XmlReader.Create/XmlWriter.Create).
I'm not sure what you really mean by your second and third bullets though. What exactly are you doing in code which feels awkward?

Have you looked at the Save method of your XmlDocument? It will save whatever is in your XmlDocument as a valid formatted file.
If your program is able to use the XmlDocument class, the XmlDocument class will be able to save your file. You won't need to worry about validating before saving, and you can give it whatever file extension you want. As to your third point... an XML file is really just a text file. It won't matter how the OS sees it.

I was a big fan of XmlDocument due to its facility to use but recently I got a huge memory problem with that class so I started to use XmlReader and XmlWriter.
XmlReader can be a little bit tricky to use if your Xml file is complex because you read the Xml file sequentially. In that case, the method ReadSubTree of XmlReader can be very useful because this method returns only the xml tree under the current node so you send the new xmlreader to a function to parse the subnode content and once it is done, you continue to the next node.
XmlReader Example:
string xmlcontent = "<BigXml/>";
using(StringReader strContent = new StringReader(xmlcontent))
{
using (XmlReader reader = XmlReader.Create(strContent))
{
while (reader.Read())
{
if (reader.Name == "SomeName" && reader.NodeType == XmlNodeType.Element)
{
//Send the XmlReader created by ReadSubTree to a function to read it.
ReadSubContentOfSomeName(reader.ReadSubtree());
}
}
}
}
XmlWriter Example:
StringBuilder builder = new StringBuilder();
using (XmlWriter writer = XmlWriter.Create(builder))
{
writer.WriteStartDocument();
writer.WriteStartElement("BigXml");
writer.WriteAttributeString("someAttribute", "42");
writer.WriteString("Some Inner Text");
//Write nodes under BigXml
writer.WriteStartElement("SomeName");
writer.WriteEndElement();
writer.WriteEndElement();
writer.WriteEndDocument();
}

Fastest way to add new node to end of an xml?

I have a large xml file (approx. 10 MB) in following simple structure:
<Errors>
<Error>.......</Error>
<Error>.......</Error>
<Error>.......</Error>
<Error>.......</Error>
<Error>.......</Error>
</Errors>
My need is to write add a new node <Error> at the end before the </Errors> tag. Whats is the fastest way to achieve this in .net?

You need to use the XML inclusion technique.
Your error.xml (doesn't change, just a stub. Used by XML parsers to read):
<?xml version="1.0"?>
<!DOCTYPE logfile [
<!ENTITY logrows
SYSTEM "errorrows.txt">
]>
<Errors>
&logrows;
</Errors>
Your errorrows.txt file (changes, the xml parser doesn't understand it):
<Error>....</Error>
<Error>....</Error>
<Error>....</Error>
Then, to add an entry to errorrows.txt:
using (StreamWriter sw = File.AppendText("logerrors.txt"))
{
XmlTextWriter xtw = new XmlTextWriter(sw);
xtw.WriteStartElement("Error");
// ... write error messge here
xtw.Close();
}
Or you can even use .NET 3.5 XElement, and append the text to the StreamWriter:
using (StreamWriter sw = File.AppendText("logerrors.txt"))
{
XElement element = new XElement("Error");
// ... write error messge here
sw.WriteLine(element.ToString());
}
See also Microsoft's article Efficient Techniques for Modifying Large XML Files

First, I would disqualify System.Xml.XmlDocument because it is a DOM which requires parsing and building the entire tree in memory before it can be appended to. This means your 10 MB of text will be more than 10 MB in memory. This means it is "memory intensive" and "time consuming".
Second, I would disqualify System.Xml.XmlReader because it requires parsing the entire file first before you can get to the point of when you can append to it. You would have to copy the XmlReader into an XmlWriter since you can't modify it. This requires duplicating your XML in memory first before you can append to it.
The faster solution to XmlDocument and XmlReader would be string manipulation (which has its own memory issues):
string xml = #"<Errors><error />...<error /></Errors>";
int idx = xml.LastIndexOf("</Errors>");
xml = xml.Substring(0, idx) + "<error>new error</error></Errors>";
Chop off the end tag, add in the new error, and add the end tag back.
I suppose you could go crazy with this and truncate your file by 9 characters and append to it. Wouldn't have to read in the file and would let the OS optimize page loading (only would have to load in the last block or something).
System.IO.FileStream fs = System.IO.File.Open("log.xml", System.IO.FileMode.Open, System.IO.FileAccess.ReadWrite);
fs.Seek(-("</Errors>".Length), System.IO.SeekOrigin.End);
fs.Write("<error>new error</error></Errors>");
fs.Close();
That will hit a problem if your file is empty or contains only "<Errors></Errors>", both of which can easily be handled by checking the length.

The fastest way would probably be a direct file access.
using (StreamWriter file = File.AppendText("my.log"))
{
file.BaseStream.Seek(-"</Errors>".Length, SeekOrigin.End);
file.Write(" <Error>New error message.</Error></Errors>");
}
But you lose all the nice XML features and may easily corrupt the file.

I would use XmlDocument or XDocument to Load your file and then manipulate it accordingly.
I would then look at the possibility of caching this XmlDocument in memory so that you can access the file quickly.
What do you need the speed for? Do you have a performance bottleneck already or are you expecting one?

How is your XML-File represented in code? Do you use the System.XML-classes? In this case you could use XMLDocument.AppendChild.

Try this out:
var doc = new XmlDocument();
doc.LoadXml("<Errors><error>This is my first error</error></Errors>");
XmlNode root = doc.DocumentElement;
//Create a new node.
XmlElement elem = doc.CreateElement("error");
elem.InnerText = "This is my error";
//Add the node to the document.
if (root != null) root.AppendChild(elem);
doc.Save(Console.Out);
Console.ReadLine();

Here's how to do it in C, .NET should be similar.
The game is to simple jump to the end of the file, skip back over the tag, append the new error line, and write a new tag.
#include <stdio.h>
#include <string.h>
#include <errno.h>
int main(int argc, char** argv) {
FILE *f;
// Open the file
f = fopen("log.xml", "r+");
// Small buffer to determine length of \n (1 on Unix, 2 on PC)
// You could always simply hard code this if you don't plan on
// porting to Unix.
char nlbuf[10];
sprintf(nlbuf, "\n");
// How long is our end tag?
long offset = strlen("</Errors>");
// Add in an \n char.
offset += strlen(nlbuf);
// Seek to the END OF FILE, and then GO BACK the end tag and newline
// so we use a NEGATIVE offset.
fseek(f, offset * -1, SEEK_END);
// Print out your new error line
fprintf(f, "<Error>New error line</Error>\n");
// Print out new ending tag.
fprintf(f, "</Errors>\n");
// Close and you're done
fclose(f);
}

The quickest method is likely to be reading in the file using an XmlReader, and simply replicating each read node to a new stream using XmlWriter When you get to the point at which you encounter the closing </Errors> tag, then you just need to output your additional <Error> element before coninuing the 'read and duplicate' cycle. This way is inevitably going to be harder than than reading the entire document into the DOM (XmlDocument class), but for large XML files, much quicker. Admittedly, using StreamReader/StreamWriter would be somewhat faster still, but pretty horrible to work with in code.

Using string-based techniques (like seeking to the end of the file and then moving backwards the length of the closing tag) is vulnerable to unexpected but perfectly legal variations in document structure.
The document could end with any amount of whitespace, to pick the likeliest problem you'll encounter. It could also end with any number of comments or processing instructions. And what happens if the top-level element isn't named Error?
And here's a situation that using string manipulation fails utterly to detect:
<Error xmlns="not_your_namespace">
...
</Error>
If you use an XmlReader to process the XML, while it may not be as fast as seeking to EOF, it will also allow you to handle all of these possible exception conditions.

I attempted to use code other answers had suggested but ran into an issue where sometimes calling .length on my strings was not the same as the number of bytes for the string so I was inconsistently losing characters. I modified it to get the byte count instead.
var endTag = "</Errors>";
var nodeText = GetNodeText();
using (FileStream file = File.Open("my.log", FileMode.Open, FileAccess.ReadWrite))
{
file.BaseStream.Seek(-(Encoding.UTF8.GetByteCount(endTag)), SeekOrigin.End);
fileStream.Write(Encoding.UTF8.GetBytes(nodeText), 0, Encoding.UTF8.GetByteCount(nodeText));
fileStream.Write(Encoding.UTF8.GetBytes(endTag), 0, Encoding.UTF8.GetByteCount(endTag));
}

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Splitting a large XML file in two using C# console app - c#

If it's pure C#, running it as a 64-bit process might solve the problem for no effort at all (assuming you have a 64-bit Windows at hand).

Related

Inserting word content into a VSTO document level customization

Read entire elements from an XML network stream

Using the SharpSVN api are there any methods available to get the number of lines contained in a file at a Revision without Exporting it?

xml and files on disc interaction

Fastest way to add new node to end of an xml?

Categories

Resources