Remove comments from an XML file with double dashes -- - c#

How can I remove invalid xml comments that contain double dashes(--) from an xml file?
I'm trying to load the xml file, but it is failing. These comments make the xml invalid. The xml comes from a vendor.
I tried removing these based on approaches from other posts, but I was not successful. Here is an example of the xml:
<?xml version="1.0" encoding="ISO-8859-1"?>
<!--MAIN VARIABLES-->
<content type="screwed">
<!--KEEP 19-39 -- SEE HELP.TXT AND THE VIDEO TUTORIALS FOR MORE INFO -->
<!--REGULAR/NON-Regular EXAMPLE --><SomeTag somefile="test.txt3" Name="test"/>
<!-- -->
</content>
I have tried the following without success:
string xmlDocFile = "c:\server\test.xml";
XmlReaderSettings readerSettings = new XmlReaderSettings();
readerSettings.IgnoreComments = true;
readerSettings.ProhibitDtd = false;
readerSettings.ValidationType = ValidationType.DTD;
XmlReader reader = XmlReader.Create(xmlDocFile, readerSettings);
XmlDocument myXmlDoc = new XmlDocument();
myXmlDoc.Load(reader);
myXmlDoc.Save(xmlDocFile);

Before using XmlReader, parse xml file and filter comments using regexp.
// using System.Text.RegularExpressions;
System.IO.StreamReader file= new System.IO.StreamReader(xmlDocFile);
string validXml = Regex.Replace(file.ReadToEnd(),"<!--.*?-->","");
XmlReader reader = XmlReader.Create(validXml);

Related

XElement.Load() and "undeclared prefix" exception

I'm trying to load xml file using Xelement.Load() method and in case of some files, I get "ditaarch" is an undeclared prefix exception. The content of such troublesome xml's are similar to this simplified version:
<?xml version="1.0" encoding="UTF-8"?>
<concept ditaarch:DITAArchVersion="1.3">
<title>Test Title</title>
<menucascade>
<uicontrol>text</uicontrol>
<uicontrol/>
</menucascade>
</concept>
I've tried to follow suggestions to manually add or ignore "ditaarch" namespace using xml namespace manager:
using (XmlReader reader = XmlReader.Create(#"C:\test\example.xml"))
{
NameTable nameTable = new NameTable();
XmlNamespaceManager nameSpaceManager = new XmlNamespaceManager(nameTable);
nameSpaceManager.AddNamespace("ditaarch", "");
XmlParserContext parserContext = new XmlParserContext(null, nameSpaceManager, null, XmlSpace.None);
XElement elem = XElement.Load(reader);
}
But it leads to same exception as before. Most probably the solution is trivial but I just can't see it :(
If anyone would be able to point me in the right direction, I would be most grateful.
The presented markup is not namespace well-formed XML so I don't think XElement or XDocument is an option as it doesn't support colons in names. You can parse it with a legacy new XmlTextReader("foo.xml") { Namespaces = false } however.
And you could use XmlDocument instead of XDocument or XElement and check for any empty elements with e.g.
XmlDocument doc = new XmlDocument();
using (XmlReader xr = new XmlTextReader("example.xml") { Namespaces = false })
{
doc.Load(xr);
}
Console.WriteLine("Number of empty elements: {0}", doc.SelectNodes("//*[not(*)][not(normalize-space())]").Count);

Parsing wrongly formatted XML-SOAP with C#

I have malformed XML (SOAP) file which I need to parse. The issue is that XML doesn't have proper header tags.
I've tried to parse file with XDocument and XmlDocument but neither has worked. XML starts from the line 30, so maybe there is some way to skip those lines before file is read by XML parser?
<SOAP-ENV:Envelope xmlns:SOAP-ENV="http://schemas.xmlsoap.org/soap/envelope/" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:eb="http://www.oasis-open.org/committees/ebxml-msg/schema/msg-header-2_0.xsd">
<SOAP-ENV:Header>
</SOAP-ENV:Header>
<SOAP-ENV:Body>
</SOAP-ENV:Body>
</SOAP-ENV:Envelope>
<?xml version="1.0" encoding="ISO-8859-1"?>
<?xml-stylesheet type="text/xsl" href="Finvoice.xsl"?>
<GGVersion="2.01" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="a.xsd">
XmlReaderSettings settings = new XmlReaderSettings();
settings.ConformanceLevel = ConformanceLevel.Fragment;
XmlReader r = XmlReader.Create(file.FullName, settings);
XmlDocument xDoc = new XmlDocument();
xDoc.PreserveWhitespace = true;
xDoc.LoadXml("<xml/>");
xDoc.DocumentElement.CreateNavigator().AppendChild(r);
XmlNamespaceManager manager = new XmlNamespaceManager(xDoc.NameTable);
Once trying to parse I get: Unexpected xml declaration. The xml declaration must be the first node in the document ....
If I understand you correctly, then the data you are looking for starts after the SOAP envelope. There is no garbage/unnessescary contents after the data you are looking for.
The SOAP header does not start with the XML declaration (<?xml version=, etc).
Looking for the start of the document
A simple solution is to find the start of the XML document (the data you are looking for), and chop away everything before that.
var startOfRealDocumentMarker = "<?xml version=\"1.0\"";
var startIndex = dirtyXmlString.IndexOf(startOfRealDocumentMarker);
if(startIndex == -1) {
throw new Exception("Start of XML not found. Now what?");
}
var cleanXmlString = dirtyXmlString.Substring(startIndex);
If the SOAP header also has an XML declaration, you could look for the end-tag of the SOAP envelope instead. Or you could start looking for the declaration at the 2nd character, so you would skip over the first one.
This is obviously not a fool-proof solution that will work in every case. But maybe it will work in all of your cases?
Skipping lines
If you're sure it will work to always start reading from line 30 of the input file, you can use this method instead.
XmlDocument xDoc = new XmlDocument();
using (var rdr = new StreamReader(pathToXmlFile))
{
// Skip until reader is positioned at start of line 30
for (var i = 0; i < 29; ++i)
{
rdr.ReadLine();
}
// Load document from current position of reader
xDoc.Load(rdr);
}

Alternatives to XDocument and XmlDocument for loading xml files in C#?

I want to change an attribute inside an xml file using C#.
Here is a sample XML file
<?xml version="1.0" encoding="us-ascii"?>
<Client>
<Age>25</Age>
<Weight>50</Weight>
</Client>
I tried loading the xml file using both XmlDocument and XDocument. They both take so much time (more than 5 minutes) to load.
Here is the code I am using to load the file:
string filePath = #"myFile.xml";
XmlDocument xmlData = new XmlDocument();
As per Google, the problem is that XDocument and XmlDocument will load all the DTDs for XML file, and this is why it takes much time. Is there a workaround for this? or maybe any alternative that allows me to change an attribute without loading all the DtDs?
You can control how DTDs are cached, parsed or used for validation with XmlReaderSettings and still use XDocument.
If you can take the time to cache the DTDs and changing them isn't part of your test, you could take the hit once and cache them.
If that's too much time or they aren't available and they aren't needed for your tests, you could skip DTD processing.
using (var reader = XmlReader.Create(_,
new XmlReaderSettings
{
DtdProcessing = DtdProcessing.Ignore,
ValidationType = ValidationType.None,
//DtdProcessing = DtdProcessing.Parse,
//ValidationType = ValidationType.DTD,
XmlResolver = new XmlUrlResolver
{
CachePolicy = new RequestCachePolicy(RequestCacheLevel.CacheIfAvailable),
//CachePolicy = new RequestCachePolicy(RequestCacheLevel.NoCacheNoStore),
}
}))
{
var doc = XDocument.Load(reader);
//…
}
XmlReaderSettings has many other properties that sometimes come in handy.

XmlReader : data invalid exception

I'm trying to read an XML file and I get an XmlException : "data at the root level is invalid. Line 1, position 1".
Here is the content of the XML file :
<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<root>
<Materials override="TRUE">
<Material name="" diffuse="" />
</Materials>
</root>
And here is my code :
using (FileStream fstr = File.OpenRead(sFullPath))
{
XmlReaderSettings settings = new XmlReaderSettings();
settings.ConformanceLevel = ConformanceLevel.Document;
fstr.Position = 0;
using (XmlReader xmlReader = XmlReader.Create(fstr, settings))
{
while (xmlReader.Read())
{
}
}
}
The exception is raised by the call to Read().
I've been searching for an answer on different sites, had a look at the MSDN too, but can't solve my problem.
My code is taken from http://www.codeproject.com/Articles/318876/Using-the-XmlReader-class-with-Csharp but I tried different snippets too.
I also checked the encoding of my file on Notepad++, tried both UTF-8 and UTF-8 without BOM, didn't make a change.
I'm stuck on this for a couple of days and I'm running out of ideas.
Thanx for your help!
Edit : removed the "..." in the snippet to avoid confusing people. I also did a try with :
using (XmlTextReader xmlReader = new XmlTextReader(fstr))
and it appears that xmlReader.Encoding is returning null, whereas my file is encoded to UTF-8.

Validate xml against dtd from string

I have an xml file that refers to a local dtd file. But the problem is that my files are being compressed into a single file (I am using Unity3D and it puts all my textfiles into one binary). This question is not Unity3D specific, it is useful for anyone that tries to load a DTD schema from a string.
I have thought of a workaround to load the xml and load the dtd separately and then add the dtd file to the XmlSchemas of my document. Like so:
private void ReadConfig(string filePath)
{
// load the xml file
TextAsset text = (TextAsset)Resources.Load(filePath);
StringReader sr = new StringReader(text.text);
sr.Read(); // skip BOM, Unity3D catch!
// load the dtd file
TextAsset dtdAsset = (TextAsset)Resources.Load("Configs/condigDtd");
XmlSchemaSet schemaSet = new XmlSchemaSet();
schemaSet.Add(...); // my dtd should be added into this schemaset somehow, but it's only a string and not a filepath.
XmlReaderSettings settings = new XmlReaderSettings() { ValidationType = ValidationType.DTD, ProhibitDtd = false, Schemas = schemaSet};
XmlReader r = XmlReader.Create(sr, settings);
XmlDocument doc = new XmlDocument();
doc.Load(r);
}
The xml starts like this, but the dtd cannot be found. Not strange, because the xml file was loaded as a string, not from a file.
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE Scene SYSTEM "configDtd.dtd">
XmlSchema has a Read method that takes in a Stream and a ValidationEventHandler. If the DTD is a string, you could convert it to a stream
System.Text.Encoding encode = System.Tet.Encoding.UTF8;
MemoryStream ms = new MemoryStream(encode.GetBytes(myDTD));
create the XmlSchema
XmlSchema mySchema = XmlSchema.Read(ms, DTDValidation);
add this schema to the XmlDocument containing the xml you are validating
myXMLDocument.Schemas.Add(mySchema);
myXMLDocument.Schemas.Compile();
myXMLDocument.Validate(DTDValidation);
The DTDValidation() handler would contain code handling what to do if the xml is invalid.

Categories