I have a large xml file (approx. 10 MB) in following simple structure:
<Errors>
<Error>.......</Error>
<Error>.......</Error>
<Error>.......</Error>
<Error>.......</Error>
<Error>.......</Error>
</Errors>
My need is to write add a new node <Error> at the end before the </Errors> tag. Whats is the fastest way to achieve this in .net?
You need to use the XML inclusion technique.
Your error.xml (doesn't change, just a stub. Used by XML parsers to read):
<?xml version="1.0"?>
<!DOCTYPE logfile [
<!ENTITY logrows
SYSTEM "errorrows.txt">
]>
<Errors>
&logrows;
</Errors>
Your errorrows.txt file (changes, the xml parser doesn't understand it):
<Error>....</Error>
<Error>....</Error>
<Error>....</Error>
Then, to add an entry to errorrows.txt:
using (StreamWriter sw = File.AppendText("logerrors.txt"))
{
XmlTextWriter xtw = new XmlTextWriter(sw);
xtw.WriteStartElement("Error");
// ... write error messge here
xtw.Close();
}
Or you can even use .NET 3.5 XElement, and append the text to the StreamWriter:
using (StreamWriter sw = File.AppendText("logerrors.txt"))
{
XElement element = new XElement("Error");
// ... write error messge here
sw.WriteLine(element.ToString());
}
See also Microsoft's article Efficient Techniques for Modifying Large XML Files
First, I would disqualify System.Xml.XmlDocument because it is a DOM which requires parsing and building the entire tree in memory before it can be appended to. This means your 10 MB of text will be more than 10 MB in memory. This means it is "memory intensive" and "time consuming".
Second, I would disqualify System.Xml.XmlReader because it requires parsing the entire file first before you can get to the point of when you can append to it. You would have to copy the XmlReader into an XmlWriter since you can't modify it. This requires duplicating your XML in memory first before you can append to it.
The faster solution to XmlDocument and XmlReader would be string manipulation (which has its own memory issues):
string xml = #"<Errors><error />...<error /></Errors>";
int idx = xml.LastIndexOf("</Errors>");
xml = xml.Substring(0, idx) + "<error>new error</error></Errors>";
Chop off the end tag, add in the new error, and add the end tag back.
I suppose you could go crazy with this and truncate your file by 9 characters and append to it. Wouldn't have to read in the file and would let the OS optimize page loading (only would have to load in the last block or something).
System.IO.FileStream fs = System.IO.File.Open("log.xml", System.IO.FileMode.Open, System.IO.FileAccess.ReadWrite);
fs.Seek(-("</Errors>".Length), System.IO.SeekOrigin.End);
fs.Write("<error>new error</error></Errors>");
fs.Close();
That will hit a problem if your file is empty or contains only "<Errors></Errors>", both of which can easily be handled by checking the length.
The fastest way would probably be a direct file access.
using (StreamWriter file = File.AppendText("my.log"))
{
file.BaseStream.Seek(-"</Errors>".Length, SeekOrigin.End);
file.Write(" <Error>New error message.</Error></Errors>");
}
But you lose all the nice XML features and may easily corrupt the file.
I would use XmlDocument or XDocument to Load your file and then manipulate it accordingly.
I would then look at the possibility of caching this XmlDocument in memory so that you can access the file quickly.
What do you need the speed for? Do you have a performance bottleneck already or are you expecting one?
How is your XML-File represented in code? Do you use the System.XML-classes? In this case you could use XMLDocument.AppendChild.
Try this out:
var doc = new XmlDocument();
doc.LoadXml("<Errors><error>This is my first error</error></Errors>");
XmlNode root = doc.DocumentElement;
//Create a new node.
XmlElement elem = doc.CreateElement("error");
elem.InnerText = "This is my error";
//Add the node to the document.
if (root != null) root.AppendChild(elem);
doc.Save(Console.Out);
Console.ReadLine();
Here's how to do it in C, .NET should be similar.
The game is to simple jump to the end of the file, skip back over the tag, append the new error line, and write a new tag.
#include <stdio.h>
#include <string.h>
#include <errno.h>
int main(int argc, char** argv) {
FILE *f;
// Open the file
f = fopen("log.xml", "r+");
// Small buffer to determine length of \n (1 on Unix, 2 on PC)
// You could always simply hard code this if you don't plan on
// porting to Unix.
char nlbuf[10];
sprintf(nlbuf, "\n");
// How long is our end tag?
long offset = strlen("</Errors>");
// Add in an \n char.
offset += strlen(nlbuf);
// Seek to the END OF FILE, and then GO BACK the end tag and newline
// so we use a NEGATIVE offset.
fseek(f, offset * -1, SEEK_END);
// Print out your new error line
fprintf(f, "<Error>New error line</Error>\n");
// Print out new ending tag.
fprintf(f, "</Errors>\n");
// Close and you're done
fclose(f);
}
The quickest method is likely to be reading in the file using an XmlReader, and simply replicating each read node to a new stream using XmlWriter When you get to the point at which you encounter the closing </Errors> tag, then you just need to output your additional <Error> element before coninuing the 'read and duplicate' cycle. This way is inevitably going to be harder than than reading the entire document into the DOM (XmlDocument class), but for large XML files, much quicker. Admittedly, using StreamReader/StreamWriter would be somewhat faster still, but pretty horrible to work with in code.
Using string-based techniques (like seeking to the end of the file and then moving backwards the length of the closing tag) is vulnerable to unexpected but perfectly legal variations in document structure.
The document could end with any amount of whitespace, to pick the likeliest problem you'll encounter. It could also end with any number of comments or processing instructions. And what happens if the top-level element isn't named Error?
And here's a situation that using string manipulation fails utterly to detect:
<Error xmlns="not_your_namespace">
...
</Error>
If you use an XmlReader to process the XML, while it may not be as fast as seeking to EOF, it will also allow you to handle all of these possible exception conditions.
I attempted to use code other answers had suggested but ran into an issue where sometimes calling .length on my strings was not the same as the number of bytes for the string so I was inconsistently losing characters. I modified it to get the byte count instead.
var endTag = "</Errors>";
var nodeText = GetNodeText();
using (FileStream file = File.Open("my.log", FileMode.Open, FileAccess.ReadWrite))
{
file.BaseStream.Seek(-(Encoding.UTF8.GetByteCount(endTag)), SeekOrigin.End);
fileStream.Write(Encoding.UTF8.GetBytes(nodeText), 0, Encoding.UTF8.GetByteCount(nodeText));
fileStream.Write(Encoding.UTF8.GetBytes(endTag), 0, Encoding.UTF8.GetByteCount(endTag));
}
Related
I am writing a network server in C# .NET 4.0. There is a network TCP/IP connection over which I can receive complete XML elements. They arrive regularly and I need to process them immediately. Each XML element is a complete XML document in itself, so it has an opening element, several sub-nodes and a closing element. There is no single root element for the entire stream. So when I open the connection, what I get is like this:
<status>
<x>123</x>
<y>456</y>
</status>
Then some time later it continues:
<status>
<x>234</x>
<y>567</y>
</status>
And so on. I need a way to read the complete XML string until a status element is complete. I don't want to do that with plain text reading methods because I don't know in what formatting the data arrives. I can in no way wait until the entire stream is finished, as is often described elsewhere. I have tried using the XmlReader class but its documentation is weird, the methods don't work out, the first element is lost and after sending the second element, an XmlException occurs because there are two root elements.
Try this:
var settings = new XmlReaderSettings
{
ConformanceLevel = ConformanceLevel.Fragment
};
using (var reader = XmlReader.Create(stream, settings))
{
while (!reader.EOF)
{
reader.MoveToContent();
var doc = XDocument.Load(reader.ReadSubtree());
Console.WriteLine("X={0}, Y={1}",
(int)doc.Root.Element("x"),
(int)doc.Root.Element("y"));
reader.ReadEndElement();
}
}
If you change the "conformance level" to "fragment", it might work with the XmlReader.
This is a (slightly modified) example from MSDN:
XmlReaderSettings settings = new XmlReaderSettings();
settings.ConformanceLevel = ConformanceLevel.Fragment;
XmlReader reader = XmlReader.Create(streamOfXmlFragments, settings);
You could use XElement.Load which is meant more for streaming of Xml Element fragments that is new in .net 3.5 and also supports reading directly from a stream.
Have a look at System.Xml.Linq
I think that you may well still have to add some control logic so as to partition the messages you are receiving, but you may as well give it a go.
I'm not sure there's anything built-in that does that.
I'd open a string builder, fill it until I see a </status> tag, and then parse it using the ordinary XmlDocument.
Not substantially different from dtb's solution, but linqier
static IEnumerable<XDocument> GetDocs(Stream xmlStream)
{
var xmlSettings = new XmlReaderSettings() { ConformanceLevel = ConformanceLevel.Fragment };
using (var xmlReader = XmlReader.Create(xmlStream, xmlSettings))
{
var xmlPathNav = new XPathDocument(xmlReader).CreateNavigator();
foreach (var selectee in xmlPathNav.Select("/*").OfType<XPathNavigator>())
yield return XDocument.Load(selectee.ReadSubtree());
}
}
I ran into a similar problem in PowerShell, but the asker's question was in C#, so I've attempted to translate it (and verified that it works). Here is where I found the clue that got me over the last little bumps (". . .The way the XPathDocument does its magic is by creating a “transparent” root node, and holding the fragments from it. I say it’s transparent because your XPath queries can use the root node axis and still get properly resolved to the fragments. . .")
The fragments of XML I'm working with happen to be smallish. If you had bigger chunks, you'd probably want to look into XStreamingElement - it can add a lot of complexity but also greatly decrease memory usage when dealing with large volumes of XML.
I need to split am XML file (~400 MB) in two, so that a legacy app can process the file. At the moment its throwing an exception when the file is over around 300 MB.
As I can't change the app which is doing the processing, I thought I could write a console app to split the file in two first. What's the best way of doing this? It needs to be automated so I can't use a text editor, and I'm using C#.
I suppose the considerations are:
writing a header to the new files after the split
finding a good place to split (not in middle of 'object')
closing off tags and file correctly in first file, opening tags correctly in second file
Any suggestions?
The "best" way is likely to be based on XmlReader and XmlWriter. Using these "streaming" APIs avoids needing to load the whole XML object model in memory (and with DOM –XmlDocument– that can need considerably more memory than the text data).
Using these APIs is harder than just loading the document: your implementation needs to track the context (eg. current node and ancestor list), but in this case that wouldn't be complex (just enough to open the elements to the current state when opening each output document).
You might want to consider making a full copy of the file and then deleting elements from each. You will have to decide at what level the deletions could occur.
It should then be fairly straightforward, from a count of how many elements have been deleted from FileA, to identify how many (and from what starting point) should be deleted from FileB.
Is that feasible for your circumstance?
I have put together the following to describe my thinking. It is not tested, but I would value the comments of the group. Downvote me if you want but I would prefer constructive criticism.
using System.Xml;
using System.Xml.Schema;
namespace ConsoleApplication1
{
class Program
{
static void Main(string[] args)
{
SplitXML(args[0], args[1]);
}
private static void SplitXML(string fileNameA, string fileNameB)
{
int deleteCount;
XmlNodeList childNodes;
XmlReader reader;
XmlTextWriter writer;
XmlDocument doc;
// ------------- Process FileA
reader = XmlReader.Create(fileNameA);
doc = new XmlDocument();
doc.Load(reader);
childNodes = doc.DocumentElement.ChildNodes;
deleteCount = childNodes.Count / 2;
for (int i = 0; i < deleteCount; i++)
{
doc.DocumentElement.RemoveChild(childNodes.Item(0));
}
writer = new XmlTextWriter("FileC", null);
doc.Save(writer);
// ------------- Process FileB
reader = XmlReader.Create(fileNameB);
doc = new XmlDocument();
doc.Load(reader);
childNodes = doc.DocumentElement.ChildNodes;
for (int i = deleteCount + 1; i < childNodes.Count; i++)
{
doc.DocumentElement.RemoveChild(childNodes.Item(deleteCount +1));
}
writer = new XmlTextWriter("FileD", null);
doc.Save(writer);
}
}
}
If it's pure C#, running it as a 64-bit process might solve the problem for no effort at all (assuming you have a 64-bit Windows at hand).
so this the code snipit:
static void Main(string[] args)
{
Console.WriteLine("Memory mapped file reader started");
using (var file = MemoryMappedFile.OpenExisting("AIDA64_SensorValues"))
{
using (var reader = file.CreateViewAccessor())
{
var bytes = new byte[3388];
var encoding = Encoding.ASCII;
XmlDocument document = new XmlDocument();
document.LoadXml("<root>" + encoding.GetString(bytes) + "</root>");
XmlNode node = document.DocumentElement.SelectSingleNode("//value");
Console.WriteLine("node = " + node);
}
}
Console.WriteLine("Press any key to exit ...");
Console.ReadLine();
}
it reads info like this from shared memory(snipit):
<sys><id>SCPUCLK</id><label>CPU Clock</label><value>2930</value></sys><sys><id>SCPUMUL</id><label>CPU Multiplier</label><value>11.0</value></sys>
and reads the value in all the <value></value> then prints it.
but its not working quite right, i get invalid character "." exception when it runs.
so i tried changing "new byte[3388];" the whole length of the string without the <root></root> is 3388(printed to TXT file on hdd to find that out) so i added 13 and got 3401(because thats how long the root tags are, which i had to add to fix multiple root error)
but i still seem to get error about "'.', hexadecimal value 0x00, is an invalid character. Line 1, position 7."
thanks
So a few thoughts:
Your error is caused by non-printing characters in your XML string, which are causing the XML validation in XMLDocument.LoadXML() to fail.
Looking at the MemoryMappedViewAccessor class that is returned (reader), you want to be checking the Capacity property of that, which will be the farthest out you can read. ReadByte() is okay, but if you know your data is ASCII, why not use ReadChar() and append them as you go to a StringBuilder object?
If the data itself has the invalid characters, there's literally no way to load this correctly during the read process, without first somehow sanitizing these characters from the string. For a scenario like that, I would dump your buffer out to a file, then load that file with a tool like NotePad++ which has a "Show All Characters" function. This will enable you to see specifically what characters and where (if they are outside of the normal XML, this might indicate you still don't quite have the buffers handling correctly).
I have a malformed XML file. The root tag is not closed by a tag. The final tag is missing.
When I try to load my malformed XML file in C#
StreamReader sr = new StreamReader(path);
batchFile = XDocument.Load(sr); // Exception
I get an exception "Unexpected end of file has occurred. The following elements are not closed: batch. Line 54, position 1."
Is it possible to ignore the close tag or to force the loading? I noticed that all my XML tools ((like XML notepad) ) automaticly fix or ignore the problem. I can not fix the XML file. This one copme from a third party software and sometimes the file is correct.
You cant do it with XDocument because this class loads all document in memory and parse it completly.
But its possible to process document with XmlReader it would get you to read and process complete document and at the end youll get missing tag exeption.
I suggest using Tidy.NET to cleanup messy input
Tidy.NET has a nice API to get a list of problems (MessageCollection) in your 'XML' and you can use it to fix the text stream in memory. The simplest thing would be to fix one error at a time, thought that will not perform too well with many errors. Otherwise, you might fix errors in reverse document order so that the offsets of messages stay valid while doing the fixes
Here is an example to convert HTML input into XHTML:
Tidy tidy = new Tidy();
/* Set the options you want */
tidy.Options.DocType = DocType.Strict;
tidy.Options.DropFontTags = true;
tidy.Options.LogicalEmphasis = true;
tidy.Options.Xhtml = true;
tidy.Options.XmlOut = true;
tidy.Options.MakeClean = true;
tidy.Options.TidyMark = false;
/* Declare the parameters that is needed */
TidyMessageCollection tmc = new TidyMessageCollection();
MemoryStream input = new MemoryStream();
MemoryStream output = new MemoryStream();
byte[] byteArray = Encoding.UTF8.GetBytes("Put your HTML here...");
input.Write(byteArray, 0 , byteArray.Length);
input.Position = 0;
tidy.Parse(input, output, tmc);
string result = Encoding.UTF8.GetString(output.ToArray());
What you could do is add the closing tag to the xml in memory and then load it.
So after loading the xml into the streamreader, manipulate the data before you do the xml load
Environment: asp.net c# openxml
Ok, so I've been reading a ton of snippets and trying to recreate the wheel, but I'm hoping that somone can help me get to my desination faster. I have multiple documents that I need to merge together... check... I'm able to do that with openxml sdk. Birds are singing, sun is shining so far. Now that I have the document the way I want it, I need to search and replace text and/or content controls.
I've tried using my own text - {replace this} but when I look at the xml (rename docx to zip and view the file), the { is nowhere near the text. So I either need to know how to protect that within the doucment so they don't diverge or I need to find another way to search and replace.
I'm able to search/replace if it is an xml file, but then I'm back to not being able to combine the doucments easily.
Code below... and as I mentioned... document merge works fine... just need to replace stuff.
* Update * changed my replace call to go after the tag instead of regex. I have the right info now, but the .Replace call doesn't seem to want to work. Last four lines are for validation that I was seeing the right tag contents. I simply want to replace those contents now.
protected void exeProcessTheDoc(object sender, EventArgs e)
{
string doc1 = Server.MapPath("~/Templates/doc1.docx");
string doc2 = Server.MapPath("~/Templates/doc2.docx");
string final_doc = Server.MapPath("~/Templates/extFinal.docx");
File.Delete(final_doc);
File.Copy(doc1, final_doc);
using (WordprocessingDocument myDoc = WordprocessingDocument.Open(final_doc, true))
{
string altChunkId = "AltChunkId2";
MainDocumentPart mainPart = myDoc.MainDocumentPart;
AlternativeFormatImportPart chunk = mainPart.AddAlternativeFormatImportPart(
AlternativeFormatImportPartType.WordprocessingML, altChunkId);
using (FileStream fileStream = File.Open(doc2, FileMode.Open))
chunk.FeedData(fileStream);
AltChunk altChunk = new AltChunk();
altChunk.Id = altChunkId;
mainPart.Document.Body.InsertAfter(altChunk, mainPart.Document.Body.Elements<Paragraph>().Last());
mainPart.Document.Save();
}
exeSearchReplace(final_doc);
}
public static void GetPropertyFromDocument(string document, string outdoc)
{
XmlDocument xmlProperties = new XmlDocument();
using (WordprocessingDocument wordDoc = WordprocessingDocument.Open(document, false))
{
ExtendedFilePropertiesPart appPart = wordDoc.ExtendedFilePropertiesPart;
xmlProperties.Load(appPart.GetStream());
}
XmlNodeList chars = xmlProperties.GetElementsByTagName("Company");
chars.Item(0).InnerText.Replace("{ClientName}", "Penn Inc.");
StreamWriter sw;
sw = File.CreateText(outdoc);
sw.WriteLine(chars.Item(0).InnerText);
sw.Close();
}
}
}
If I'm reading this right, you have something like "{replace me}" in a .docx and then when you loop through the XML, you're finding things like <t>{replace</t><t> me</><t>}</t> or some such havoc. Now, with XML like that, it's impossible to create a routine that will replace "{replace me}".
If that's the case, then it's very, very likely related to the fact that it's considered a proofing error. i.e. it's misspelled as far as Word is concerned. The cause of it is that you've opened the document in Word and have proofing turned on. As such, the text is marked as "isDirty" and split up into different runs.
The two ways about fixing this are:
Client-side. In Word, just make sure all proofing errors are either corrected or ignored.
Format-side. Use the MarkupSimplifier tool that is part of Open XML Package Editor Power Tool for Visual Studio 2010 to fix this outside of the client. Eric White has a great (and timely for you - just a few days old) write up here on it: Getting Started with Open XML PowerTools Markup Simplifier
If you want to search and replace text in a WordprocessingML document, there is a fairly easy algorithm that you can use:
Break all runs into runs of a single character. This includes runs that have special characters such as a line break, carriage return, or hard tab.
It is then pretty easy to find a set of runs that match the characters in your search string.
Once you have identified a set of runs that match, then you can replace that set of runs with a newly created run (which has the run properties of the run containing the first character that matched the search string).
After replacing the single-character runs with a newly created run, you can then consolidate adjacent runs with identical formatting.
I've written a blog post and recorded a screen-cast that walks through this algorithm.
Blog post: http://openxmldeveloper.org/archive/2011/05/12/148357.aspx
Screen cast: http://www.youtube.com/watch?v=w128hJUu3GM
-Eric