I have a huge chunk of XML data that I need to "clean". The Xml looks something like this:
<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<w:document xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main">
<w:body>
<w:p>
<w:t>F_ck</w:t>
<!-- -->
<w:t>F_ck</w:t>
<!-- -->
<w:t>F_ck</w:t>
</w:p>
</w:body>
</w:document>
I would like to identify the <w:t>-elements with the value "F_ck" and replace the value with something else. The elements I need to clean will be scattered throughout the document.
I need the code to run as fast as possible and with a memory footprint as small as possible, so I am reluctant to use the XDocument (DOM) approaches I have found here and elsewhere.
The data is given to me as a stream containing the Xml data, and my gut feeling tells me that I need the XmlTextReader and the XmlTextWriter.
My original idea was to do a SAX-mode, forward-only run through the Xml data and "pipe" it over to the XmlTextWriter, but I cannot find an intelligent way to do so.
I wrote this code:
var reader = new StringReader(content);
var xmltextReader = new XmlTextReader(reader);
var memStream = new MemoryStream();
var xmlWriter = new XmlTextWriter(memStream, Encoding.UTF8);
while (xmltextReader.Read())
{
if (xmltextReader.Name == "w:t")
{
//xmlWriter.WriteRaw("blah");
}
else
{
xmlWriter.WriteRaw(xmltextReader.Value);
}
}
The code above only takes the value of elements declaration etc, so no brackets or anything. I realize that I could write code that specifically executed .WriteElement(), .WriteEndElement() etc depending on the NodeType, but I fear that will quickly be a mess.
So the question is:
How do I - in a nice way - pipe the xml data read from the XmlTextReader to the XmlTextWriter while still being able to manipulate the data while piping?
Try this
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Xml;
using System.Xml.Linq;
namespace ConsoleApplication1
{
class Program
{
static void Main(string[] args)
{
string xml =
"<?xml version=\"1.0\" encoding=\"utf-8\" standalone=\"yes\"?>" +
"<w:document xmlns:w=\"http://schemas.openxmlformats.org/wordprocessingml/2006/main\">" +
"<w:body>" +
"<w:p>" +
"<w:t>F_ck</w:t>" +
"<!-- -->" +
"<w:t>F_ck</w:t>" +
"<!-- -->" +
"<w:t>F_ck</w:t>" +
"</w:p>" +
"</w:body>" +
"</w:document>";
XDocument doc = XDocument.Parse(xml);
XElement document = (XElement)doc.FirstNode;
XNamespace ns_w = document.GetNamespaceOfPrefix("w");
List<XElement> ts = doc.Descendants(ns_w + "t").ToList();
foreach (XElement t in ts)
{
t.Value = "abc";
}
}
}
}
Related
I have a large single node .xml file that I have saved as a string. I want to parse the .xml file for a specific element read and output the innertext. EG: I want to read the FrameNo element and output BINGO to a messagebox. The desired element will only appear once in the .xml document. I prefer using XmlDocument.
I have tried numerous C# .xml examples but am unable to get a output.
xml text is
<Aircraft z:Id="i1" xmlns="http://xxx.yyyyycontract.gov/2018/03/Boeing.xxxxxxxxxxxxxx.Airframe"
xmlns:i="http://www.xxxxxxx.com/2019/XMLSchema-instance"
xmlns:z="http://xxxxxxx.xxxxxxxxx.com/2005/01/Serialization/"><Timestamp i:nil="true"/>
<Uuid>00000000-0000-0000-0000-000000000000</Uuid><Comments i:nil="true"/><Facility>..........
and so on to the end of the .xml
<FrameNo>BINGO</FrameNo><WDate i:nil="true"/></Aircraft>
this is the code section I want to have the code execute in.
private void buttonLoad_Click(object sender, EventArgs e)
{
}
I think, this is self-explanatory
using System.Xml.Linq;
XElement root = XElement.Load(textXML);
XElement myElement = root.Element("FrameNo");
if (myElement != null)
myData = myElement.InnerText;
Thanks to jdweng I wanted to share the final code for others to use. This will function in a method like below
private void buttonMaint_Click(object sender, EventArgs e)
{
XDocument doc = XDocument.Parse(xmlinputstr); // input string from memory or input file
XNamespace ns = doc.Root.GetDefaultNamespace();
string[] Frame = doc.Descendants(ns + "FrameNo").Select(x => (string)x).ToArray(); // selects element to read + trailing character of >
string frame = string.Join("", Frame); //converts from array to string
if (string.IsNullOrEmpty(frame)) // check for empty result
{
txtFrame.Text = "not found"; //outputs to textbox
}
else
{
txtFrame.Text = (frame); //outputs to textbox
}
}
Comments are there for clarity
You need to use the default namespace. See my xml linq solution below :
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Xml;
using System.Xml.Linq;
using System.IO;
namespace ConsoleApplication1
{
class Program
{
const string FILENAME = #"c:\temp\test.xml";
static void Main(string[] args)
{
string xml = File.ReadAllText(FILENAME);
XDocument doc = XDocument.Parse(xml);
XNamespace ns = doc.Root.GetDefaultNamespace();
XElement frameNo = doc.Descendants(ns + "FrameNo").FirstOrDefault();
string frame = (string)frameNo;
string[] serialNumbers = doc.Descendants(ns + "SerialNumber").Select(x => (string)x).ToArray();
}
}
}
Another weird snag has shown up. Some of the elements are named like this.
<a:SupplierServDoc>
the innertext contents of this element is a base64 packet. There is no problem processing the base64 packet.
The code from the above answers does output the base64 correctly but cannot handle the : in the element name. It throws a 3A hex character error.
I have this code that outputs the inntertext but not as a base64 packet. I have also looked into prefix to handle the : but with worse results. I am outputting the base 64 innertext as a .txt file when finished.
XNamespace ad = http://www.mmmmmmmmmm.com";
XName k = ad + "SupplierServDoc";
string[] WING = doc.Descendants(k).Select(x => (string)x).ToArray();
string wing = string.Join("", WING);
if (string.IsNullOrEmpty(syncd))
{
MessageBox.Show("a:SupplierServDoc Base 64 code not found");
}
else
{
MessageBox.Show("Test " + wing);
}
I have an XML document which basically looks like this:
<ArrayOfAspect xmlns:i="http://www.w3.org/2001/XMLSchema-instance">
<Aspect i:type="TransactionAspect">
...
</Aspect>
<Aspect i:type="TransactionAspect">
...
</Aspect>
</ArrayOfAspect>
And I want to append a new Aspect to this list.
In order to do so I load this xml from a file, create a XmlDocumentFragment and load the new Aspect from a file (which is basically a template I fill with data). Then I fill the document fragment with the new aspect and append it as a child.
But when I try to set the xml of this fragment it fails because the prefix i is not defined.
// Load all aspects
var aspectsXml = new XmlDocument();
aspectsXml.Load("aspects.xml");
// Create and fill the fragment
var fragment = aspectsXml.CreateDocumentFragment();
fragment.InnerXml = _templateIFilledWithData; // This fails because i is not defined
// Add the new child
aspectsXml.AppendChild(fragment)
This is how the template looks like:
<Aspect i:type="TransactionAspect">
<Value>$VALUES_PLACEHOLDER$</Value>
...
</Aspect>
Note that I don't want to create POCOs for this and serialize them since the aspects are actualy quite big and nested and I have the same problem with some other xml files as well.
EDIT:
jdweng proposed to use XmlLinq (Which is way better than what I used before, so thanks). Here is the code I try to use with XmlLinq (still failing because of undeclared prefix):
var aspects = XDocument.Load("aspects.xml");
var newAspects = EXlement.Parse(_templateIFilledWithData); // Fails here - Undeclared prefix 'i'
aspects.Root.add(newAspect);
Use xml linq :
using System.Collections.ObjectModel;
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Xml;
using System.Xml.Linq;
namespace ConsoleApplication57
{
class Program
{
const string URL = "http://goalserve.com/samples/soccer_inplay.xml";
static void Main(string[] args)
{
string xml =
"<ArrayOfAspect xmlns:i=\"http://www.w3.org/2001/XMLSchema-instance\">" +
"<Aspect i:type=\"TransactionAspect\">" +
"</Aspect>" +
"<Aspect i:type=\"TransactionAspect\">" +
"</Aspect>" +
"</ArrayOfAspect>";
XDocument doc = XDocument.Parse(xml);
XElement root = doc.Root;
XNamespace nsI = root.GetNamespaceOfPrefix("i");
root.Add(new XElement("Aspect", new object[] {
new XAttribute(nsI + "type", "TransactionAspect"),
new XElement("Value", "$VALUES_PLACEHOLDER$")
}));
}
}
}
I generated a XML file through API call then I tried to read the file using XML source component in ssis but it is read only data sets except all data contains in file .
Here my file
<?XML version 1.0 >
<ABC>
<a>info<a/>
<ABC/>
But I want file like below then only I can easily read file using component
We can manipulate the file manually for single file but not for thousand files
<?XML Version 1.0>
<X>
<ABC>
<a>info <a/>
<ABC/>
</X>
How to add that 'X' node to the existing file .
I am not having much exposure on .Net technology .
Kindly help me at the earliest of time .
Thank You
KiranKumar
Using xml linq
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Xml;
using System.Xml.Linq;
namespace ConsoleApplication1
{
class Program
{
static void Main(string[] args)
{
string xml =
"<?xml version=\"1.0\" encoding=\"utf-8\" ?>" +
"<ABC>" +
"<a>info</a>" +
"</ABC>";
XDocument doc = XDocument.Parse(xml);
XElement root = doc.Root;
root.ReplaceWith(new XElement("X", root));
}
}
}
Try streaming API.
using (var reader = XmlReader.Create("test.xml"))
using (var writer = XmlWriter.Create("test2.xml"))
{
writer.WriteStartElement("X");
reader.MoveToContent();
writer.WriteNode(reader.ReadSubtree(), true);
writer.WriteEndElement();
}
This approach handles xml without excessive memory consumption.
Also, this method allows to modify xml on the fly, getting it from the input API stream and writing to output stream.
using (var reader = XmlReader.Create(inputStream))
using (var writer = XmlWriter.Create(outputStream))
I have next problem: I created downloader, which downloads xml documents, but in one the document has problem, in the document not end tag. For example:
<?xml version="1.0"?>
<rows xmlns:fo="http://www.w3.org/1999/XSL/Format">
<row StateID="AK">
I have next code:
public void SaveFiles(SftpClient sftp, string DirectoryName, string PathToFile)
{
foreach (Renci.SshNet.Sftp.SftpFile ftpfile in sftp.ListDirectory("." + DirectoryName))
{
DateTime downloadTime = ftpfile.LastWriteTime;
string newFileName = ftpfile.Name;
bool checkFile = check(PathToFile, newFileName, downloadTime);
if (checkFile == true)
{
FileStream fs = new FileStream(PathToFile + "\\" + ftpfile.Name, FileMode.Create);
sftp.DownloadFile(ftpfile.FullName, fs);
fs.Close();
File.SetLastWriteTime(PathToFile + "\\" + ftpfile.Name, downloadTime);
}
else
{
continue;
}
}
}
Document containing unclosed tag is not XML at all. As others suggested in comments, ideally the effort to fix this problem is done by the party that generate the document.
Regarding the original question, detecting unclosed tag in general isn't a trivial task. I would suggest to try HtmlAgilityPack (HAP). It has built in functionality to automatically close unclosed tags (closing tag added immediately after the opening tag).
example using HAP :
using HtmlAgilityPack;
......
var xml = #"<?xml version=""1.0""?>
<rows xmlns:fo=""http://www.w3.org/1999/XSL/Format"">
<row StateID=""AK"">";
var doc = new HtmlDocument();
doc.LoadHtml(xml);
Console.WriteLine(doc.DocumentNode.OuterHtml);
output :
<?xml version="1.0"?>
<rows xmlns:fo="http://www.w3.org/1999/XSL/Format">
<row stateid="AK"></row></rows>
I'm trying to edit XML file saving its format:
<root>
<files>
<file>a</file>
<file>b</file>
<file>c</file>
<file>d</file>
</files>
</root>
So i load xml document using XDocument xDoc = XDocument.Load(path, LoadOptions.PreserveWhitespace);
But when i'm trying to add new elements xDoc.Root.Element("files").Add(new XElement("test","test"));
xDoc.Root.Element("files").Add(new XElement("test2","test2"));
it adds in the same line, so output is like:
<root>
<files>
<file>a</file>
<file>b</file>
<file>c</file>
<file>d</file>
<test>test</test><test2>test2</test2></files>
</root>
So how can i add new elements each on new line saving initial formatting? I tried to use XmlWriter with Setting.Indent = true to save XDocument, but as i see, elements are added to the same line, when i use xDoc.Root.Element().Add()
Update: full part of program loading, modifying and saving document
using System;
using System.Xml;
using System.Xml.Linq;
using System.Text;
namespace ConsoleApplication2
{
class Program
{
static void Main(string[] args)
{
string path = #".\doc.xml";
XDocument xDoc = XDocument.Load(path, LoadOptions.PreserveWhitespace);
//when i debug i see in "watch" that after these commands new elements are already added in same line
xDoc.Descendants("files").First().Add(new XElement("test", "test"));
xDoc.Descendants("files").First().Add(new XElement("test2", "test2"));
XmlWriterSettings settings = new XmlWriterSettings();
settings.Encoding = Encoding.UTF8;
settings.Indent = true;
settings.IndentChars = "\t";
using (XmlWriter writer = XmlTextWriter.Create(path, settings))
{
xDoc.Save(writer);
//Here i also tried save without writer - xDoc.Save(path)
}
}
}
}
The problem appears to be caused by your use of LoadOptions.PreserveWhitespace. This seems to trump XmlWriterSettings.Indent - you've basically said, "I care about this whitespace"... "Oh, now I don't."
If you remove that option, just using:
XDocument xDoc = XDocument.Load(path);
... then it indents appropriately. If you want to preserve all the original whitespace but then indent just the new elements, I think you'll need to add that indentation yourself.
I had a similar problem and I could solve with the code below:
var newPolygon = new XElement(doc.Root.GetDefaultNamespace() + "polygon");
groupElement.Add(newPolygon);
groupElement.Add(Environment.NewLine);
I hope this code can help some people...