How to determine if XML is well formed? - c#

I've got a large xml document in a string. What's the best way to determine if the xml is well formed?

Something like:
static void Main() {
Test("<abc><def/></abc>");
Test("<abc><def/><abc>");
}
static void Test(string xml) {
using (XmlReader xr = XmlReader.Create(
new StringReader(xml))) {
try {
while (xr.Read()) { }
Console.WriteLine("Pass");
} catch (Exception ex) {
Console.WriteLine("Fail: " + ex.Message);
}
}
}
If you need to check against an xsd, then use XmlReaderSettings.

Simply run it through a parser. That will perform the appropriate checks (whether it parses ok).
If it's a large document (as indicated) then an event-based parser (e.g. SAX) will be appropriate since it won't store the document in memory.
It's often useful to have XML utilities around to check this sort of stuff. I use XMLStarlet, which is a command-line set of tools for XML checking/manipulation.

XmlReader seems a good choice as it should stream the data (not load the whole xml in one go)
http://msdn.microsoft.com/en-us/library/9d83k261.aspx

Try using an XmlReader with an XmlReaderSettings that has ConformanceLevel.Document set.

Related

How to change encoding of read RSS XML in C#

I would like to create RSS for my favourite website, but the problem is that it's RSS XML contains first line which corrupts whole RSS when parsing.
I get this error:
System does not support 'ISO-8859-2' encoding. Line 1, position 31.
Code:
void wc_OpenReadCompleted(object sender, OpenReadCompletedEventArgs e)
{
SyndicationFeed feed;
try {
using (XmlReader reader = XmlReader.Create(e.Result)) {
// I WOULD LIKE to delete some rows from the Result
feed = SyndicationFeed.Load(reader);
lista.ItemsSource = feed.Items;
}
} catch (WebException we) {
MessageBox.Show("The internet connection is down.");
}
}
Apparently the .NET framework used in WP7 doesn't support other encodings than UTF-8 and ISO-8859-1. What you could do, is to generate your own encoding implementation using this tool.
Then you read the stream with a detour via a StreamReader using the custom encoding:
using ( StreamReader sReader = new StreamReader(e.Result, new CustomEncoding()) )
using ( XmlReader xReader = XmlReader.Create(sReader) )
{
//...
}
You could either try to re-encode the string that is in e.Result, perhaps by using the Encoding.Convert method in .NET. But this will probably not be enough since I assume that there is a encoding="ISO-8859-2" attribute in the xml-code. So you will probably also need to do a String.Replace that attribute with something else.
Or just try to replace the attribute with another one and see if that works. Do a e.Result.Replace("ISO-8859-2", "UTF-8") and see what happens. If that doesn't work, try the first option of converting the strings encoding to another one and then to the replace.

Read XML values Webservice

I want to read xml on runtime, without save it on a path
After my searching i find that, In console application i need to use Console.Out for displaying result
xmlSerializer.Serialize(Console.Out, patient);
In Windows / Web Application we need to set path like
StreamWriter streamWriter = new StreamWriter(#"C:\test.xml");
but i need to read xml with out save it, i am using Webserive where i need to read it and take a decision that either it is valid or not
I hope i define it clearly..
Use the XmlDocument object.
There are several ways to load the XML, you can use the XmlDocument.Load() and specify your URL in there or use XmlDocument.LoadXml() to load the XML from a string.
You could use the XmlDocument.LoadXml class to read the received xml. There is no need to save it to disk.
try
{
XmlDocument doc = new XmlDocument();
doc.LoadXml(receivedXMLStr);
//valid xml
}
catch (XmlException xe)
{
//invalid xml
}
Use Linq2Xml..
XElement doc;
try
{
doc=XElement.Load(yourStream);
}
catch
{
//invalid XML
}
foreach(XElement node in doc.Descendants())
{
node.Value;//value of this node
nodes.Attributes();//all the attributes of this node
}
Thanks all of you for your reply, i want to laod my XML without save it on a local Path, because saving creating many XML.
Finally i find the solutions for load the XML from class on a Memory stream, I thinn this solution is very easy and optimize
XmlDocument doc = new XmlDocument();
System.Xml.Serialization.XmlSerializer serializer2 = new System.Xml.Serialization.XmlSerializer(Patients.GetType());
System.IO.MemoryStream stream = new System.IO.MemoryStream();
serializer2.Serialize(stream, Patients);
stream.Position = 0;
doc.Load(stream);
You need to use the Deserialize option to read the xml. Follow the below steps to achieve it,
Create a target class. It structure should represent the xml output.
After creating the class, use the below code to load your xml into the target object
TargetType result = null;
XmlSerializer worker = new XmlSerializer(typeof(TargetType));
result = worker.Deserialize("<xml>.....</xml>");
Now the xml is loaded into the object 'result' and use it.

Convert XML to Plain Text

My goal is to build an engine that takes the latest HL7 3.0 CDA documents and make them backward compatible with HL7 2.5 which is a radically different beast.
The CDA document is an XML file which when paired with its matching XSL file renders a HTML document fit for display to the end user.
In HL7 2.5 I need to get the rendered text, devoid of any markup, and fold it into a text stream (or similar) that I can write out in 80 character lines to populate the HL7 2.5 message.
So far, I'm taking an approach of using XslCompiledTransform to transform my XML document using XSLT and product a resultant HTML document.
My next step is to take that document (or perhaps at a step before this) and render the HTML as text. I have searched for a while, but can't figure out how to accomplish this. I'm hoping its something easy that I'm just overlooking, or just can't find the magical search terms. Can anyone offer some help?
FWIW, I've read the 5 or 10 other questions in SO which embrace or admonish using RegEx for this, and don't think that I want to go down that road. I need the rendered text.
using System;
using System.IO;
using System.Xml;
using System.Xml.Xsl;
using System.Xml.XPath;
public class TransformXML
{
public static void Main(string[] args)
{
try
{
string sourceDoc = "C:\\CDA_Doc.xml";
string resultDoc = "C:\\Result.html";
string xsltDoc = "C:\\CDA.xsl";
XPathDocument myXPathDocument = new XPathDocument(sourceDoc);
XslCompiledTransform myXslTransform = new XslCompiledTransform();
XmlTextWriter writer = new XmlTextWriter(resultDoc, null);
myXslTransform.Load(xsltDoc);
myXslTransform.Transform(myXPathDocument, null, writer);
writer.Close();
StreamReader stream = new StreamReader (resultDoc);
}
catch (Exception e)
{
Console.WriteLine ("Exception: {0}", e.ToString());
}
}
}
Since you have the XML source, consider writing an XSL that will give you the output you want without the intermediate HTML step. It would be far more reliable than trying to transform the HTML.
This will leave you with just the text:
class Program
{
static void Main(string[] args)
{
var blah = new System.IO.StringReader(sourceDoc);
var reader = System.Xml.XmlReader.Create(blah);
StringBuilder result = new StringBuilder();
while (reader.Read())
{
result.Append( reader.Value);
}
Console.WriteLine(result);
}
static string sourceDoc = "<html><body><p>this is a paragraph</p><p>another paragraph</p></body></html>";
}
Or you can use a regular expression:
public static string StripHtml(String htmlText)
{
// replace all tags with spaces...
htmlText = Regex.Replace(htmlText, #"<(.|\n)*?>", " ");
// .. then eliminate all double spaces
while (htmlText.Contains(" "))
{
htmlText = htmlText.Replace(" ", " ");
}
// clear out non-breaking spaces and & character code
htmlText = htmlText.Replace(" ", " ");
htmlText = htmlText.Replace("&", "&");
return htmlText;
}
Can you use something like this which uses lynx and perl to render the html and then convert that to plain text?
This is a great use-case for XSL:FO and FOP. FOP isn't just for PDF output, one of the other major outputs that is supported is text. You should be able to construct a simple xslt + fo stylesheet that has the specifications (i.e. line width) that you want.
This solution will is a bit more heavy-weight that just using xml->xslt->text as ScottSEA suggested, but if you have any more complex formatting requirements (e.g. indenting), it will become much easier to express in fo, than mocking up in xslt.
I would avoid regexs for extracting the text. That's too low-level and guaranteed to be brittle. If you just want text and 80 character lines, the default xslt template will only print element text. Once you have only the text, you can apply whatever text processing is necessary.
Incidentally, I work for a company who produces CDAs as part of our product (voice recognition for dications). I would look into an XSLT that transforms the 3.0 directly into 2.5. Depending on the fidelity you want to keep between the two versions, the full XSLT route will probably be your easiest bet if what you really want to achieve is conversion between the formats. That's what XSLT was built to do.

An error occurred while parsing EntityName

I'm trying to load a xml document into an object XPathDocument in C#.
My xml documents include this line:
trés dégagée + rade
and when the parser arrives there it gives me this error:
"An error occurred while parsing EntityName"
I know that's normal cause of the character "é". Does anybody know how can I avoid this error... My idea is to insert into the xml document an entities declaration and after replace all special characters with entities...but it's long and I’m not sure if it's working. Do you have other ideas? Simpler?
Thanks a lot
Was about to post this and just then the servers went down. I think I've rewritten it correctly from memory:
I think that the problem lies within the fact that by default the XPathDocument uses an XmlTextReader to parse the contents of the supplied file and this XmlTextReader uses an EntityHandling setting of ExpandEntities.
In other words, when you rely on the default settings, an XmlTextReader will validate the input XML and try to resolve all entities. The better way is to do this manually by taking full control over the XmlReaderSettings (I always do it manually):
string myXMLFile = "SomeFile.xml";
string fileContent = LoadXML(myXMLFile);
private string LoadXML(string xml)
{
XPathDocument xDoc;
XmlReaderSettings xrs = new XmlReaderSettings();
// The following line does the "magic".
xrs.CheckCharacters = false;
using (XmlReader xr = XmlReader.Create(xml, xrs))
{
xDoc = new XPathDocument(xr);
}
if (xDoc != null)
{
XPathNavigator xNav = xDoc.CreateNavigator();
return xNav.OuterXml;
}
else
// Unable to load file
return null;
}
Typically this is caused by a mismatch between the encoding used to read the file and the files actually encoding.
At a guess I would say the file is UTF-8 encoded but you are reading it with a default encoding.
Try beefing up your question with more details to get a more definitive answer.

How to Update a XML Node?

It is easy to read an XML file and get the exact Node Text, but how do I Update that Node with a new value?
To read:
public static String GetSettings(SettingsType type, SectionType section)
{
XmlTextReader reader = new XmlTextReader(HttpContext.Current.Request.MapPath(APPSETTINGSPATH));
XmlDocument document = new XmlDocument();
document.Load(reader);
XmlNode node = document.SelectSingleNode(
String.Format("/MyRootName/MySubNode/{0}/{1}",
Enum.Parse(typeof(SettingsType), type.ToString()),
Enum.Parse(typeof(SectionType), section.ToString())));
return node.InnerText;
}
to write ...?
public static void SetSettings(SettingsType type, SectionType section, String value)
{
try
{
XmlTextReader reader = new XmlTextReader(HttpContext.Current.Request.MapPath(APPSETTINGSPATH));
XmlDocument document = new XmlDocument();
document.Load(reader);
XmlNode node = document.SelectSingleNode(
String.Format("/MyRootName/MySubNode/{0}/{1}",
Enum.Parse(typeof(SettingsType), type.ToString()),
Enum.Parse(typeof(SectionType), section.ToString())));
node.InnerText = value;
node.Update();
}
catch (Exception ex)
{
throw new Exception("Error:", ex);
}
}
Note the line, node.Update(); does not exist, but that's what I wanted :)
I saw the XmlTextWriter object, but it will write the entire XML to a new file, and I just need to update one value in the original Node, I can save as a new file and then rename the new file into the original name but... it has to be simpler to do this right?
Any of you guys have a sample code on about to do this?
Thank you
You don't need an "update" method - setting the InnerText property updates it. However, it only applies the update in memory. You do need to rewrite the whole file though - you can't just update a small part of it (at least, not without a lot of work and no out-of-the-box support).
XmlDocument.Load has an overload that will take the filename directly so there is no need for the reader.
Similarly when you are done XmlDocument.Save will take a filename to which it will save the document.
The nodeValue property can be used to change the value of a text node.
The following code changes the text node value of the first element:
Example:
xmlDoc=loadXMLDoc("books.xml");
x=xmlDoc.getElementsByTagName("title")[0].childNodes[0];
x.nodeValue="Easy Cooking";
source: http://www.w3schools.com/DOM/dom_nodes_set.asp
You're updating the node in an in-memory representation of the xml document, AFAIK there's no way to update the node directly in the physical file. You have to dump it all back to a file.

Categories