How to change encoding of read RSS XML in C#

How to change encoding of read RSS XML in C# - c#

I would like to create RSS for my favourite website, but the problem is that it's RSS XML contains first line which corrupts whole RSS when parsing.
I get this error:
System does not support 'ISO-8859-2' encoding. Line 1, position 31.
Code:
void wc_OpenReadCompleted(object sender, OpenReadCompletedEventArgs e)
{
SyndicationFeed feed;
try {
using (XmlReader reader = XmlReader.Create(e.Result)) {
// I WOULD LIKE to delete some rows from the Result
feed = SyndicationFeed.Load(reader);
lista.ItemsSource = feed.Items;
}
} catch (WebException we) {
MessageBox.Show("The internet connection is down.");
}
}

Apparently the .NET framework used in WP7 doesn't support other encodings than UTF-8 and ISO-8859-1. What you could do, is to generate your own encoding implementation using this tool.
Then you read the stream with a detour via a StreamReader using the custom encoding:
using ( StreamReader sReader = new StreamReader(e.Result, new CustomEncoding()) )
using ( XmlReader xReader = XmlReader.Create(sReader) )
{
//...
}

You could either try to re-encode the string that is in e.Result, perhaps by using the Encoding.Convert method in .NET. But this will probably not be enough since I assume that there is a encoding="ISO-8859-2" attribute in the xml-code. So you will probably also need to do a String.Replace that attribute with something else.
Or just try to replace the attribute with another one and see if that works. Do a e.Result.Replace("ISO-8859-2", "UTF-8") and see what happens. If that doesn't work, try the first option of converting the strings encoding to another one and then to the replace.

Related

xml.LoadData - Data at the root level is invalid. Line 1, position 1

I'm trying to parse some XML inside a WiX installer. The XML would be an object of all my errors returned from a web server. I'm getting the error in the question title with this code:
XmlDocument xml = new XmlDocument();
try
{
xml.LoadXml(myString);
}
catch (Exception ex)
{
System.IO.File.WriteAllText(#"C:\text.txt", myString + "\r\n\r\n" + ex.Message);
throw ex;
}
myString is this (as seen in the output of text.txt)
<?xml version="1.0" encoding="utf-8"?>
<Errors></Errors>
text.txt comes out looking like this:
<?xml version="1.0" encoding="utf-8"?>
<Errors></Errors>
Data at the root level is invalid. Line 1, position 1.
I need this XML to parse so I can see if I had any errors.

The hidden character is probably BOM.
The explanation to the problem and the solution can be found here, credits to James Schubert, based on an answer by James Brankin found here.
Though the previous answer does remove the hidden character, it also removes the whole first line. The more precise version would be:
string _byteOrderMarkUtf8 = Encoding.UTF8.GetString(Encoding.UTF8.GetPreamble());
if (xml.StartsWith(_byteOrderMarkUtf8))
{
xml = xml.Remove(0, _byteOrderMarkUtf8.Length);
}
I encountered this problem when fetching an XSLT file from Azure blob and loading it into an XslCompiledTransform object.
On my machine the file looked just fine, but after uploading it as a blob and fetching it back, the BOM character was added.

Use Load() method instead, it will solve the problem. See more

The issue here was that myString had that header line. Either there was some hidden character at the beginning of the first line or the line itself was causing the error. I sliced off the first line like so:
xml.LoadXml(myString.Substring(myString.IndexOf(Environment.NewLine)));
This solved my problem.

I Think that the problem is about encoding. That's why removing first line(with encoding byte) might solve the problem.
My solution for Data at the root level is invalid. Line 1, position 1.
in XDocument.Parse(xmlString) was replacing it with XDocument.Load( new MemoryStream( xmlContentInBytes ) );
I've noticed that my xml string looked ok:
<?xml version="1.0" encoding="utf-8"?>
but in different text editor encoding it looked like this:
?<?xml version="1.0" encoding="utf-8"?>
At the end i did not need the xml string but xml byte[]. If you need to use the string you should look for "invisible" bytes in your string and play with encodings to adjust the xml content for parsing or loading.
Hope it will help

Save your file with different encoding:
File > Save file as... > Save as UTF-8 without signature.
In VS 2017 you find encoding as a dropdown next to Save button.

Main culprit for this error is logic which determines encoding when converting Stream or byte[] array to .NET string.
Using StreamReader created with 2nd constructor parameter detectEncodingFromByteOrderMarks set to true, will determine proper encoding and create string which does not break XmlDocument.LoadXml method.
public string GetXmlString(string url)
{
using var stream = GetResponseStream(url);
using var reader = new StreamReader(stream, true);
return reader.ReadToEnd(); // no exception on `LoadXml`
}
Common mistake would be to just blindly use UTF8 encoding on the stream or byte[]. Code bellow would produce string that looks valid when inspected in Visual Studio debugger, or copy-pasted somewhere, but it will produce the exception when used with Load or LoadXml if file is encoded differently then UTF8 without BOM.
public string GetXmlString(string url)
{
byte[] bytes = GetResponseByteArray(url);
return System.Text.Encoding.UTF8.GetString(bytes); // potentially exception on `LoadXml`
}

I've solved this issue by directly editing the byte array.
Collect the UTF8 preamble and remove directly the header.
Afterward you can transform the byte[]to a string with GetString method, see below.
The \r and \t I've removed as well, just as precaution.
XmlDocument configurationXML = new XmlDocument();
List<byte> byteArray = new List<byte>(webRequest.downloadHandler.data);
foreach(byte singleByte in Encoding.UTF8.GetPreamble())
{
byteArray.RemoveAt(byteArray.IndexOf(singleByte));
}
string xml = System.Text.Encoding.UTF8.GetString(byteArray.ToArray());
xml = xml.Replace("\\r", "");
xml = xml.Replace("\\t", "");

If your xml is in a string use the following to remove any byte order mark:
xml = new Regex("\\<\\?xml.*\\?>").Replace(xml, "");

At first I had problems escaping the "&" character, then diacritics and special letters were shown as question marks and ended up with the issue OP mentioned.
I looked at the answers and I used #Ringo's suggestion to try Load() method as an alternative. That made me realize that I can deal with my response in other ways not just as a string.
using System.IO.Stream instead of string solved all the issues for me.
var response = await this.httpClient.GetAsync(url);
var responseStream = await response.Content.ReadAsStreamAsync();
var xmlDocument = new XmlDocument();
xmlDocument.Load(responseStream);
The cool thing about Load() is that this method automatically detects the string format of the input XML (for example, UTF-8, ANSI, and so on). See more

I have found out one of the solutions.
For your code this could be as follows -
XmlDocument xml = new XmlDocument();
try
{
// assuming the location of the file is in the current directory
// assuming the file name be loadData.xml
string myString = "./loadData.xml";
xml.Load(myString);
}
catch (Exception ex)
{
System.IO.File.WriteAllText(#"C:\text.txt", myString + "\r\n\r\n" + ex.Message);
throw ex;
}

if we are using XDocument.Parse(#"").
Use # it resolves the issue.

Using an XmlDataDocument object is much better than using an XDocument or XmlDocument object. XmlDataDocument works fine with UTF8 and it doesn't have problems with Byte Order Sequences. You can get the child nodes of each element using ChildNodes property.
Use a custom function such as the following one:
static public void ReadXmlDataDocument2(string xmlFilePath)
{
if (xmlFilePath != null)
{
if (File.Exists(xmlFilePath))
{
System.IO.FileStream fs = default(System.IO.FileStream);
try
{
fs = new System.IO.FileStream(xmlFilePath, System.IO.FileMode.Open, System.IO.FileAccess.Read);
System.Xml.XmlDataDocument k_XDoc = new System.Xml.XmlDataDocument();
k_XDoc.Load(fs);
fs.Close();
fs.Dispose();
fs = null;
XmlNodeList ndsRoot = k_XDoc.ChildNodes;
foreach (System.Xml.XmlNode xLog in ndsRoot)
{
foreach (System.Xml.XmlNode xLog2 in xLog.ChildNodes)
{
if (xLog2.Name == "ERRORs")
{
foreach (System.Xml.XmlNode xLog3 in xLog2.ChildNodes)
{
if (xLog3.Name == "ErrorCode")
{
// Do something
}
if (xLog3.Name == "Description")
{
// Do something
}
}
}
}
}
}
catch (Exception ex)
{
MessageBox.Show(ex.Message);
}
}
}
}

Why do DownloadStringAsync results have strange characters when XmlReader results don't?

I am using an RSS feed as a data source. I've adjusted how I retrive the data to use the WebClient.DownloadStringAsync method. Before I was using XmlReader.Create method. I'm then using the results as a data source to bind to a TextBlock in WPF.
Since I made the change when I display my results all the special encoded characters are appearing as odd characters. Would love some help making sure I keep the proper encoding so that my display values don't have the odd characters.
// WebClient code...
private void GetFeed(int i)
{
Uri _feedUri = new Uri(_feedList[i]);
webClient = new WebClient();
webClient.DownloadStringCompleted += new DownloadStringCompletedEventHandler(stringStringCompletedEvent);
webClient.DownloadStringAsync(_feedUri, _feedTokenList[i]);
}
private void stringStringCompletedEvent(object sender, DownloadStringCompletedEventArgs e)
{
if (e.Error == null)
{
string xmlString = e.Result;
XmlDocument doc = new XmlDocument();
doc.LoadXml(xmlString);
doc.Save(#"C:\CashierData\msnbc-top.xml");
}
}
Here is my previous code using XmlReader to download and parse the feed.
XmlReader reader = XmlReader.Create(feed, settings);

I suspect the problem is that the web server isn't specifying the content encoding correctly.
You could use DownloadDataAsync instead, and then use
doc.Load(new MemoryStream(data));
(where data is the byte array). Then the XML parser will get to auto-detect the XML encoding from the binary data, instead of trusting the web server to know the right encoding.

Convert XML to Plain Text

My goal is to build an engine that takes the latest HL7 3.0 CDA documents and make them backward compatible with HL7 2.5 which is a radically different beast.
The CDA document is an XML file which when paired with its matching XSL file renders a HTML document fit for display to the end user.
In HL7 2.5 I need to get the rendered text, devoid of any markup, and fold it into a text stream (or similar) that I can write out in 80 character lines to populate the HL7 2.5 message.
So far, I'm taking an approach of using XslCompiledTransform to transform my XML document using XSLT and product a resultant HTML document.
My next step is to take that document (or perhaps at a step before this) and render the HTML as text. I have searched for a while, but can't figure out how to accomplish this. I'm hoping its something easy that I'm just overlooking, or just can't find the magical search terms. Can anyone offer some help?
FWIW, I've read the 5 or 10 other questions in SO which embrace or admonish using RegEx for this, and don't think that I want to go down that road. I need the rendered text.
using System;
using System.IO;
using System.Xml;
using System.Xml.Xsl;
using System.Xml.XPath;
public class TransformXML
{
public static void Main(string[] args)
{
try
{
string sourceDoc = "C:\\CDA_Doc.xml";
string resultDoc = "C:\\Result.html";
string xsltDoc = "C:\\CDA.xsl";
XPathDocument myXPathDocument = new XPathDocument(sourceDoc);
XslCompiledTransform myXslTransform = new XslCompiledTransform();
XmlTextWriter writer = new XmlTextWriter(resultDoc, null);
myXslTransform.Load(xsltDoc);
myXslTransform.Transform(myXPathDocument, null, writer);
writer.Close();
StreamReader stream = new StreamReader (resultDoc);
}
catch (Exception e)
{
Console.WriteLine ("Exception: {0}", e.ToString());
}
}
}

Since you have the XML source, consider writing an XSL that will give you the output you want without the intermediate HTML step. It would be far more reliable than trying to transform the HTML.

This will leave you with just the text:
class Program
{
static void Main(string[] args)
{
var blah = new System.IO.StringReader(sourceDoc);
var reader = System.Xml.XmlReader.Create(blah);
StringBuilder result = new StringBuilder();
while (reader.Read())
{
result.Append( reader.Value);
}
Console.WriteLine(result);
}
static string sourceDoc = "<html><body><p>this is a paragraph</p><p>another paragraph</p></body></html>";
}

Or you can use a regular expression:
public static string StripHtml(String htmlText)
{
// replace all tags with spaces...
htmlText = Regex.Replace(htmlText, #"<(.|\n)*?>", " ");
// .. then eliminate all double spaces
while (htmlText.Contains(" "))
{
htmlText = htmlText.Replace(" ", " ");
}
// clear out non-breaking spaces and & character code
htmlText = htmlText.Replace(" ", " ");
htmlText = htmlText.Replace("&", "&");
return htmlText;
}

Can you use something like this which uses lynx and perl to render the html and then convert that to plain text?

This is a great use-case for XSL:FO and FOP. FOP isn't just for PDF output, one of the other major outputs that is supported is text. You should be able to construct a simple xslt + fo stylesheet that has the specifications (i.e. line width) that you want.
This solution will is a bit more heavy-weight that just using xml->xslt->text as ScottSEA suggested, but if you have any more complex formatting requirements (e.g. indenting), it will become much easier to express in fo, than mocking up in xslt.
I would avoid regexs for extracting the text. That's too low-level and guaranteed to be brittle. If you just want text and 80 character lines, the default xslt template will only print element text. Once you have only the text, you can apply whatever text processing is necessary.
Incidentally, I work for a company who produces CDAs as part of our product (voice recognition for dications). I would look into an XSLT that transforms the 3.0 directly into 2.5. Depending on the fidelity you want to keep between the two versions, the full XSLT route will probably be your easiest bet if what you really want to achieve is conversion between the formats. That's what XSLT was built to do.

An error occurred while parsing EntityName

I'm trying to load a xml document into an object XPathDocument in C#.
My xml documents include this line:
trés dégagée + rade
and when the parser arrives there it gives me this error:
"An error occurred while parsing EntityName"
I know that's normal cause of the character "é". Does anybody know how can I avoid this error... My idea is to insert into the xml document an entities declaration and after replace all special characters with entities...but it's long and I’m not sure if it's working. Do you have other ideas? Simpler?
Thanks a lot

Was about to post this and just then the servers went down. I think I've rewritten it correctly from memory:
I think that the problem lies within the fact that by default the XPathDocument uses an XmlTextReader to parse the contents of the supplied file and this XmlTextReader uses an EntityHandling setting of ExpandEntities.
In other words, when you rely on the default settings, an XmlTextReader will validate the input XML and try to resolve all entities. The better way is to do this manually by taking full control over the XmlReaderSettings (I always do it manually):
string myXMLFile = "SomeFile.xml";
string fileContent = LoadXML(myXMLFile);
private string LoadXML(string xml)
{
XPathDocument xDoc;
XmlReaderSettings xrs = new XmlReaderSettings();
// The following line does the "magic".
xrs.CheckCharacters = false;
using (XmlReader xr = XmlReader.Create(xml, xrs))
{
xDoc = new XPathDocument(xr);
}
if (xDoc != null)
{
XPathNavigator xNav = xDoc.CreateNavigator();
return xNav.OuterXml;
}
else
// Unable to load file
return null;
}

Typically this is caused by a mismatch between the encoding used to read the file and the files actually encoding.
At a guess I would say the file is UTF-8 encoded but you are reading it with a default encoding.
Try beefing up your question with more details to get a more definitive answer.

How to determine if XML is well formed?

I've got a large xml document in a string. What's the best way to determine if the xml is well formed?

Something like:
static void Main() {
Test("<abc><def/></abc>");
Test("<abc><def/><abc>");
}
static void Test(string xml) {
using (XmlReader xr = XmlReader.Create(
new StringReader(xml))) {
try {
while (xr.Read()) { }
Console.WriteLine("Pass");
} catch (Exception ex) {
Console.WriteLine("Fail: " + ex.Message);
}
}
}
If you need to check against an xsd, then use XmlReaderSettings.

Simply run it through a parser. That will perform the appropriate checks (whether it parses ok).
If it's a large document (as indicated) then an event-based parser (e.g. SAX) will be appropriate since it won't store the document in memory.
It's often useful to have XML utilities around to check this sort of stuff. I use XMLStarlet, which is a command-line set of tools for XML checking/manipulation.

XmlReader seems a good choice as it should stream the data (not load the whole xml in one go)
http://msdn.microsoft.com/en-us/library/9d83k261.aspx

Try using an XmlReader with an XmlReaderSettings that has ConformanceLevel.Document set.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

How to change encoding of read RSS XML in C# - c#

Related

xml.LoadData - Data at the root level is invalid. Line 1, position 1

Why do DownloadStringAsync results have strange characters when XmlReader results don't?

Convert XML to Plain Text

An error occurred while parsing EntityName

How to determine if XML is well formed?

Categories

Resources