Merging huge (2GB) XMLs in memory (without any memory exceptions) - c#

I would like a C# code that optimally appends 2 XML strings. Both of them are of same schema. I tried StreamReader / StreamWriter; File.WriteAllText; FileStream
The problem I see is, it uses more than 98% of physical memory thus results in out of memory exception.
Is there a way to optimally merge without getting any memory exceptions? Time is not a concern for me.
If making it available in memory is going to be a problem, then what else could be better? Saving it on File system?
Further Details:
Here is my simple program: to provide better detail
static void Main(string[] args)
{
Program p = new Program();
XmlDocument x1 = new XmlDocument();
XmlDocument x2 = new XmlDocument();
x1.Load("C:\\XMLFiles\\1.xml");
x2.Load("C:\\XMLFiles\\2.xml");
List<string> files = new List<string>();
files.Add("C:\\XMLFiles\\1.xml");
files.Add("C:\\XMLFiles\\2.xml");
p.ConsolidateFiles(files, "C:\\XMLFiles\\Result.xml");
p.MergeFiles("C:\\XMLFiles\\Result.xml", x1.OuterXml, x2.OuterXml, "<Data>", "</Data>");
Console.ReadLine();
}
public void ConsolidateFiles(List<String> files, string outputFile)
{
var output = new StreamWriter(File.Open(outputFile, FileMode.Create));
output.WriteLine("<Data>");
foreach (var file in files)
{
var input = new StreamReader(File.Open(file, FileMode.Open));
string line;
while (!input.EndOfStream)
{
line = input.ReadLine();
if (!line.Contains("<Data>") &&
!line.Contains("</Data>"))
{
output.Write(line);
}
}
}
output.WriteLine("</Data>");
}
public void MergeFiles(string outputPath, string xmlState, string xmlFederal, string prefix, string suffix)
{
File.WriteAllText(outputPath, prefix);
File.AppendAllText(outputPath, xmlState);
File.AppendAllText(outputPath, xmlFederal);
File.AppendAllText(outputPath, suffix);
}
XML Sample:
<Data> </Data> is appended at the beginning & End
XML 1: <Sections> <Section></Section> </Sections>
XML 2: <Sections> <Section></Section> </Sections>
Merged: <Data> <Sections> <Section></Section> </Sections> <Sections> <Section></Section> </Sections> </Data>

Try this, a stream based approach which avoids loading all the xml into memory at once.
static void Main(string[] args)
{
List<string> files = new List<string>();
files.Add("C:\\XMLFiles\\1.xml");
files.Add("C:\\XMLFiles\\2.xml");
ConsolidateFiles(files, "C:\\XMLFiles\\Result.xml");
Console.ReadLine();
}
private static void ConsolidateFiles(List<String> files, string outputFile)
{
using (var output = new StreamWriter(outputFile))
{
output.WriteLine("<Data>");
foreach (var file in files)
{
using (var input = new StreamReader(file, FileMode.Open))
{
while (!input.EndOfStream)
{
string line = input.ReadLine();
if (!line.Contains("<Data>") &&
!line.Contains("</Data>"))
{
output.Write(line);
}
}
}
}
output.WriteLine("</Data>");
}
}
An even better approach is to use XmlReader (http://msdn.microsoft.com/en-us/library/system.xml.xmlreader(v=vs.90).aspx). This will give you a stream reader designed specifically for xml, rather than StreamReader which is for reading general text.

Take a look here
The answer given by Teoman Soygul seems to be what you're looking for.

This is untested, but I would do something along these lines using TextReader and TextWriter. You do not want to read all of the XML text into memory or store it in a string, and you do not want to use XElement/XDocument/etc. anywhere in the middle.
using (var writer = new XmlTextWriter("ResultFile.xml")
{
writer.WriteStartDocument();
writer.WriteStartElement("Data");
using (var reader = new XmlTextReader("XmlFile1.xml")
{
reader.Read();
while (reader.Read())
{
writer.WriteNode(reader, true);
}
}
using (var reader = new XmlTextReader("XmlFile2.xml")
{
reader.Read();
while (reader.Read())
{
writer.WriteNode(reader, true);
}
}
writer.WriteEndElement("Data");
}
Again no guarantees that this exact code will work as-is (or that it even compiles), but I think that is the idea you're looking for. Stream data from File1 first and write it directly out to the result file. Then, stream data from File2 and write it out. At no point should a full XML file be in memory.

If you run on 64bit, try this: go to your project properties -> build tab -> Platform target: change "Any CPU" to "x64".
This solved my problem for loading huge XML files in memory.

you have to go to file system, unless you have lots of RAM
one simple approach:
File.WriteAllText("output.xml", "<Data>");
File.AppendAllText("output.xml", File.ReadAllText("xml1.xml"));
File.AppendAllText("output.xml", File.ReadAllText("xml2.xml"));
File.AppendAllText("output.xml", "</Data>");
another:
var fNames = new[] { "xml1.xml", "xml2.xml" };
string line;
using (var writer = new StreamWriter("output.xml"))
{
writer.WriteLine("<Data>");
foreach (var fName in fNames)
{
using (var file = new System.IO.StreamReader(fName))
{
while ((line = file.ReadLine()) != null)
{
writer.WriteLine(line);
}
}
}
writer.WriteLine("</Data>");
}
All of this with the premise that there is not schema, or tags inside xml1.xml and xml2.xml
If that is the case, just code to omit them.

Related

Xml gets corrupted each time I append a node

I have an Xml file as:
<?xml version="1.0"?>
<hashnotes>
<hashtags>
<hashtag>#birthday</hashtag>
<hashtag>#meeting</hashtag>
<hashtag>#anniversary</hashtag>
</hashtags>
<lastid>0</lastid>
<Settings>
<Font>Arial</Font>
<HashtagColor>red</HashtagColor>
<passwordset>0</passwordset>
<password></password>
</Settings>
</hashnotes>
I then call a function to add a node in the xml,
The function is :
public static void CreateNoteNodeInXDocument(XDocument argXmlDoc, string argNoteText)
{
string lastId=((Convert.ToInt32(argXmlDoc.Root.Element("lastid").Value)) +1).ToString();
string date = DateTime.Now.ToString("MM/dd/yyyy");
argXmlDoc.Element("hashnotes").Add(new XElement("Note", new XAttribute("ID", lastId), new XAttribute("Date",date),new XElement("Text", argNoteText)));
//argXmlDoc.Root.Note.Add new XElement("Text", argNoteText)
List<string> hashtagList = Utilities.GetHashtagsFromText(argNoteText);
XElement reqNoteElement = (from xml2 in argXmlDoc.Descendants("Note")
where xml2.Attribute("ID").Value == lastId
select xml2).FirstOrDefault();
if (reqNoteElement != null)
{
foreach (string hashTag in hashtagList)
{
reqNoteElement.Add(new XElement("hashtag", hashTag));
}
}
argXmlDoc.Root.Element("lastid").Value = lastId;
}
After this I save the xml.
Next time when I try to load the Xml, it fails with an exception:
System.Xml.XmlException: Unexpected XML declaration. The XML declaration must be the first node in the document, and no white space characters are allowed to appear before it.
Here is the code to load the XML:
private static XDocument hashNotesXDocument;
private static Stream hashNotesStream;
StorageFile hashNoteXml = await InstallationFolder.GetFileAsync("hashnotes.xml");
hashNotesStream = await hashNoteXml.OpenStreamForWriteAsync();
hashNotesXDocument = XDocument.Load(hashNotesStream);
and I save it using:
hashNotesXDocument.Save(hashNotesStream);
You don't show all of your code, but it looks like you open the XML file, read the XML from it into an XDocument, edit the XDocument in memory, then write back to the opened stream. Since the stream is still open it will be positioned at the end of the file and thus the new XML will be appended to the file.
Suggest eliminating hashNotesXDocument and hashNotesStream as static variables, and instead open and read the file, modify the XDocument, then open and write the file using the pattern shown here.
I'm working only on desktop code (using an older version of .Net) so I can't test this, but something like the following should work:
static async Task LoadUpdateAndSaveXml(Action<XDocument> editor)
{
XDocument doc;
var xmlFile = await InstallationFolder.GetFileAsync("hashnotes.xml");
using (var reader = new StreamReader(await xmlFile.OpenStreamForReadAsync()))
{
doc = XDocument.Load(reader);
}
if (doc != null)
{
editor(doc);
using (var writer = new StreamWriter(await xmlFile.OpenStreamForWriteAsync()))
{
// Truncate - https://stackoverflow.com/questions/13454584/writing-a-shorter-stream-to-a-storagefile
if (writer.CanSeek && writer.Length > 0)
writer.SetLength(0);
doc.Save(writer);
}
}
}
Also, be sure to create the file before using it.

What is the most efficient way to take XML from API and store it locally?

I am trying to find the fastest way to read XML from the merriam webster dictionary, and store it to a local file for later use. Below, I try to implement a module which does a few things:
Read 2000 words from a local directory
Look up each of the words in the merriam dictionary using the API
Store the definition(s) in a local XML for later use.
Im not sure if making an XML is the best way to store this data, but it seemed like the simplest thing to do. At first, I thought I would do it in different steps. (1. Look up word, store word and definitions into data structure. 2. Dump all data into XML.) However, this poses a problem, because it just too much stuff to store on the runtime(call) stack.
So, in this scenario, I try to speed things up by looking up each word and then saving it to the xml one by one. This, however, is also a slow method. Its taking me up around 10 minutes per 500-600 words.
public void load_module() // stores words/definitions into xml file
{ // 1. Pick up word from text file 2. Look up word's definition 3. Store in Xml
string workdirect = Directory.GetCurrentDirectory();
workdirect = workdirect.Substring(0, workdirect.LastIndexOf("bin"));
workdirect += "words1.txt";
using (StreamReader read = new StreamReader(workdirect)) // 1. Pick up word from text file
{
while (!read.EndOfStream)
{
string line = read.ReadLine();
var definitions = load(line.ToLower()); // 2. Retrieve Words Definitions
store_xml(line, definitions);
wordlist.Add(line);
}
}
}
public List<string> load(string word)
{
XmlDocument doc = new XmlDocument();
List<string> definitions = new List<string>();
XmlNodeList node = null;
doc.Load("http://www.dictionaryapi.com/api/v1/references/collegiate/xml/"+word+"?key=*****************"); // Asteriks to hide the actual API key
if (doc.SelectSingleNode("entry_list").SelectSingleNode("entry").SelectSingleNode("def") == null)
{
return definitions;
}
node = doc.SelectSingleNode("entry_list").SelectSingleNode("entry").SelectSingleNode("def").SelectNodes("dt");
// TO DO : implement definitions if there is no node "def" in first node entry "entry_list"
foreach (XmlNode item in node)
{
definitions.Add(item.InnerXml.ToString().ToLower());
}
return definitions;
}
public void store_xml(string word, List<string> definitions)
{
string local = Directory.GetCurrentDirectory();
string name = "dictionary_word.xml";
local = local.Substring(0, local.LastIndexOf("bin"));
bool exists = File.Exists(local + name);
if (exists)
{
XmlDocument doc = new XmlDocument();
doc.Load(local + name);
XmlElement wordindoc = doc.CreateElement("Word");
wordindoc.SetAttribute("xmlns", word);
XmlElement defs = doc.CreateElement("Definitions");
foreach (var item in definitions)
{
XmlElement def = doc.CreateElement("Definition");
def.InnerText = item;
defs.AppendChild(def);
}
wordindoc.AppendChild(defs);
doc.DocumentElement.AppendChild(wordindoc);
doc.Save(local+name);
}
else
{
using (XmlWriter writer = XmlWriter.Create(#local + name))
{
writer.WriteStartDocument();
writer.WriteStartElement("Dictionary");
writer.WriteStartElement("Word", word);
writer.WriteStartElement("Definitions");
foreach (var def in definitions)
{
writer.WriteElementString("Definition", def);
}
writer.WriteEndElement();
writer.WriteEndElement();
writer.WriteEndElement();
writer.WriteEndDocument();
}
}
}
}
When handling large amounts of data that need to be exported to XML, I would normally keep the data in memory as a collection of custom objects rather than as an XMLDocument:
public class Definition
{
public string Word { get; set; }
public string Definition { get; set; }
}
I would then use XMLWriter to write the collection to the XML file:
XmlWriterSettings settings = new XmlWriterSettings();
settings.Indent = true;
settings.IndentChars = (" ");
settings.Encoding = Encoding.UTF8;
using (XmlWriter writer = XmlWriter.Create("C:\output\output.xml", settings))
{
writer.WriteStartDocument();
// TODO - use XMLWriter functions to write out each word and definition
writer.Flush();
}
If you are still short on memory, you might be able to write out the XML in batches (e.g. every 500 definitions).
I found the Microsoft article on Improving XML Performance a very useful reference, particularly the section on Design Considerations.

Manipulating Word 2007 Document XML in C#

I am trying to manipulate the XML of a Word 2007 document in C#. I have managed to find and manipulate the node that I want but now I can't seem to figure out how to save it back. Here is what I am trying:
// Open the document from memoryStream
Package pkgFile = Package.Open(memoryStream, FileMode.Open, FileAccess.ReadWrite);
PackageRelationshipCollection pkgrcOfficeDocument = pkgFile.GetRelationshipsByType(strRelRoot);
foreach (PackageRelationship pkgr in pkgrcOfficeDocument)
{
if (pkgr.SourceUri.OriginalString == "/")
{
Uri uriData = new Uri("/word/document.xml", UriKind.Relative);
PackagePart pkgprtData = pkgFile.GetPart(uriData);
XmlDocument doc = new XmlDocument();
doc.Load(pkgprtData.GetStream());
NameTable nt = new NameTable();
XmlNamespaceManager nsManager = new XmlNamespaceManager(nt);
nsManager.AddNamespace("w", nsUri);
XmlNodeList nodes = doc.SelectNodes("//w:body/w:p/w:r/w:t", nsManager);
foreach (XmlNode node in nodes)
{
if (node.InnerText == "{{TextToChange}}")
{
node.InnerText = "success";
}
}
if (pkgFile.PartExists(uriData))
{
// Delete template "/customXML/item1.xml" part
pkgFile.DeletePart(uriData);
}
PackagePart newPkgprtData = pkgFile.CreatePart(uriData, "application/xml");
StreamWriter partWrtr = new StreamWriter(newPkgprtData.GetStream(FileMode.Create, FileAccess.Write));
doc.Save(partWrtr);
partWrtr.Close();
}
}
pkgFile.Close();
I get the error 'Memory stream is not expandable'. Any ideas?
I would recommend that you use Open XML SDK instead of hacking the format by yourself.
Using OpenXML SDK 2.0, I do this:
public void SearchAndReplace(Dictionary<string, string> tokens)
{
using (WordprocessingDocument doc = WordprocessingDocument.Open(_filename, true))
ProcessDocument(doc, tokens);
}
private string GetPartAsString(OpenXmlPart part)
{
string text = String.Empty;
using (StreamReader sr = new StreamReader(part.GetStream()))
{
text = sr.ReadToEnd();
}
return text;
}
private void SavePart(OpenXmlPart part, string text)
{
using (StreamWriter sw = new StreamWriter(part.GetStream(FileMode.Create)))
{
sw.Write(text);
}
}
private void ProcessDocument(WordprocessingDocument doc, Dictionary<string, string> tokenDict)
{
ProcessPart(doc.MainDocumentPart, tokenDict);
foreach (var part in doc.MainDocumentPart.HeaderParts)
{
ProcessPart(part, tokenDict);
}
foreach (var part in doc.MainDocumentPart.FooterParts)
{
ProcessPart(part, tokenDict);
}
}
private void ProcessPart(OpenXmlPart part, Dictionary<string, string> tokenDict)
{
string docText = GetPartAsString(part);
foreach (var keyval in tokenDict)
{
Regex expr = new Regex(_starttag + keyval.Key + _endtag);
docText = expr.Replace(docText, keyval.Value);
}
SavePart(part, docText);
}
From this you could write a GetPartAsXmlDocument, do what you want with it, and then stream it back with SavePart(part, xmlString).
Hope this helps!
You should use the OpenXML SDK to work on docx files and not write your own wrapper.
Getting Started with the Open XML SDK 2.0 for Microsoft Office
Introducing the Office (2007) Open XML File Formats
How to: Manipulate Office Open XML Formats Documents
Manipulate Docx with C# without Microsoft Word installed with OpenXML SDK
The problem appears to be doc.Save(partWrtr), which is built using newPkgprtData, which is built using pkgFile, which loads from a memory stream... Because you loaded from a memory stream it's trying to save the document back to that same memory stream. This leads to the error you are seeing.
Instead of saving it to the memory stream try saving it to a new file or to a new memory stream.
The short and simple answer to the issue with getting 'Memory stream is not expandable' is:
Do not open the document from memoryStream.
So in that respect the earlier answer is correct, simply open a file instead.
Opening from MemoryStream editing the document (in my experience) easy lead to 'Memory stream is not expandable'.
I suppose the message appears when one do edits that requires the memory stream to expand.
I have found that I can do some edits but not anything that add to the size.
So, f.ex deleting a custom xml part is ok but adding one and some data is not.
So if you actually need to open a memory stream you must figure out how to open an expandable MemoryStream if you want to add to it.
I have a need for this and hope to find a solution.
Stein-Tore Erdal
PS: just noticed the answer from "Jan 26 '11 at 15:18".
Don't think that is the answer in all situations.
I get the error when trying this:
var ms = new MemoryStream(bytes);
using (WordprocessingDocument wd = WordprocessingDocument.Open(ms, true))
{
...
using (MemoryStream msData = new MemoryStream())
{
xdoc.Save(msData);
msData.Position = 0;
ourCxp.FeedData(msData); // Memory stream is not expandable.

File locks when using file.move in c#...how can I stop or fix this

Code:
String tempFile = Path.GetTempFileName(), read = "";
TextReader pending = new StreamReader("c:\\pending.txt");
TextWriter temp = new StreamWriter(tempFile);
read = pending.ReadLine();
while ((read = pending.ReadLine()) != null)
{
temp.WriteLine(read);
}
pending.Close();
temp.Close();
File.Delete("c:\\pending.txt");
File.Move(tempFile, "c:\\pending.txt");
The pending.txt file is created when the program starts if it doesn't exist. This code deletes the first line of the file. When I debug the code, I notice that the
File.Move(tempFile, "c:\\pending.txt");
locks the file and I cannot write to it anymore.
You should close your StreamReader and StreamWriter in using statements, like this:
String tempFile = Path.GetTempFileName(), read = "";
using(TextReader pending = new StreamReader("c:\\pending.txt"))
using(TextWriter temp = new StreamWriter(tempFile))
{
read = pending.ReadLine();
while ((read = pending.ReadLine()) != null)
{
temp.WriteLine(read);
}
}
File.Delete(#"c:\pending.txt");
File.Move(tempFile, #"c:\pending.txt");
I had a similar situation with XML results files, produced by the xUnit console runner. I'm adding it as an answer here in case it helps others find the cause/solution when the StreamReader is in the form of an XmlTextReader, which is built on top of Stream and TextReader, and also can place locks on the underlying file that make subsequent Move and Delete operations fail if the underlying stream and reader are not closed and disposed of immediately when done with reads.
public void ReadResultsXmlFile(string testResultsXmlFile)
{
MyXmlTextReader = new XmlTextReader(testResultsXmlFile);
testResultXmlDocument = new XmlDocument();
testResultXmlDocument.Load(MyXmlTextReader);
XmlNode xnAssembliesHeader = testResultXmlDocument.SelectSingleNode("/assemblies");
XmlNodeList xnAssemblyList = testResultXmlDocument.SelectNodes("/assemblies/assembly");
foreach (XmlNode assembly in xnAssemblyList)
{
XmlNodeList xnTestList = testResultXmlDocument.SelectNodes(
"/assemblies/assembly/collection/test");
foreach (XmlNode test in xnTestList)
{
TestName = test.Attributes.GetNamedItem("name").Value;
TestDuration = test.Attributes.GetNamedItem("time").Value;
PassOrFail = test.Attributes.GetNamedItem("result").Value;
}
}
}
Of course, it's obvious in hindsight that I failed to close the XmlTextReader that includes an underlying StreamReader, and that this was leaving locks on the XML results files.
The fixed code looks like this:
public void ReadResultsXmlFile(string testResultsXmlFile)
{
using (MyXmlTextReader = new XmlTextReader(testResultsXmlFile))
{
testResultXmlDocument = new XmlDocument();
testResultXmlDocument.Load(MyXmlTextReader); // suppose that myXmlString contains "<Names>...</Names>"
XmlNode xnAssembliesHeader = testResultXmlDocument.SelectSingleNode("/assemblies");
XmlNodeList xnAssemblyList = testResultXmlDocument.SelectNodes("/assemblies/assembly");
foreach (XmlNode assembly in xnAssemblyList)
{
XmlNodeList xnTestList = testResultXmlDocument.SelectNodes(
"/assemblies/assembly/collection/test");
foreach (XmlNode test in xnTestList)
{
TestName = test.Attributes.GetNamedItem("name").Value;
TestDuration = test.Attributes.GetNamedItem("time").Value;
PassOrFail = test.Attributes.GetNamedItem("result").Value;
}
}
}
}
... and the problem with locked files for subsequent Move and Delete operations went away. The key lines of course, being
using (MyXmlTextReader = new XmlTextReader(testResultsXmlFile))
{
// Do stuff inside "using... block
} // At close of "using" block, objects in using get Released...
For what it's worth, the app is a test runner that runs Selenium-based automated web tests, using the xUnit console test runner, and telling xUnit console runner to create XML results files, via a command-line option. SpecFlow is also involved, on top of the xUnit test runner layer, but xUnit results files are what is being read. After the test execution, I wanted to move the XML results files into date-based archive folders, and that File.Move() operation was failing due to locks on the xUnit results files, as a result of the code without the using block.

BOM encoding for database storage

I'm using the following code to serialise an object:
public static string Serialise(IMessageSerializer messageSerializer, DelayMessage message)
{
using (var stream = new MemoryStream())
{
messageSerializer.Serialize(new[] { message }, stream);
return Encoding.UTF8.GetString(stream.ToArray());
}
}
Unfortunately, when I save it to a database (using LINQ to SQL), then query the database, the string appears to start with a question mark:
?<z:anyType xmlns...
How do I get rid of that? When I try to de-serialise using the following:
public static DelayMessage Deserialise(IMessageSerializer messageSerializer, string data)
{
using (var stream = new MemoryStream(Encoding.UTF8.GetBytes(data)))
{
return (DelayMessage)messageSerializer.Deserialize(stream)[0];
}
}
I get the following exception:
"Error in line 1 position 1. Expecting
element 'anyType' from namespace
'http://schemas.microsoft.com/2003/10/Serialization/'..
Encountered 'Text' with name '',
namespace ''. "
The implementations of the messageSerializer use the DataContractSerializer as follows:
public void Serialize(IMessage[] messages, Stream stream)
{
var xws = new XmlWriterSettings { ConformanceLevel = ConformanceLevel.Fragment };
using (var xmlWriter = XmlWriter.Create(stream, xws))
{
var dcs = new DataContractSerializer(typeof(IMessage), knownTypes);
foreach (var message in messages)
{
dcs.WriteObject(xmlWriter, message);
}
}
}
public IMessage[] Deserialize(Stream stream)
{
var xrs = new XmlReaderSettings { ConformanceLevel = ConformanceLevel.Fragment };
using (var xmlReader = XmlReader.Create(stream, xrs))
{
var dcs = new DataContractSerializer(typeof(IMessage), knownTypes);
var messages = new List<IMessage>();
while (false == xmlReader.EOF)
{
var message = (IMessage)dcs.ReadObject(xmlReader);
messages.Add(message);
}
return messages.ToArray();
}
}
Unfortunately, when I save it to a database (using LINQ to SQL), then query the database, the string appears to start with a question mark:
?<z:anyType xmlns...
Your database is not set up to support Unicode characters. You write a string including a BOM in it, the database can't store it so mangles it into a '?'. Then when you come back to read the string as XML, the '?' is text content outside the root element and you get an error. (You can only have whitespace text outside the root element.)
Why is the BOM getting there? Because Microsoft love dropping BOMs all over the the place, even when they're not needed (and they never are, with UTF-8). The solution is to make your own instance of UTF8Encoding instead of using the built-in Encoding.UTF8, and tell it you don't want its stupid BOMs:
Encoding utf8onlynotasridiculouslysucky= new UTF8Encoding(false);
However, this is only really masking the real issue, which is the database configuration.

Categories