Why do DownloadStringAsync results have strange characters when XmlReader results don't? - c#

I am using an RSS feed as a data source. I've adjusted how I retrive the data to use the WebClient.DownloadStringAsync method. Before I was using XmlReader.Create method. I'm then using the results as a data source to bind to a TextBlock in WPF.
Since I made the change when I display my results all the special encoded characters are appearing as odd characters. Would love some help making sure I keep the proper encoding so that my display values don't have the odd characters.
// WebClient code...
private void GetFeed(int i)
{
Uri _feedUri = new Uri(_feedList[i]);
webClient = new WebClient();
webClient.DownloadStringCompleted += new DownloadStringCompletedEventHandler(stringStringCompletedEvent);
webClient.DownloadStringAsync(_feedUri, _feedTokenList[i]);
}
private void stringStringCompletedEvent(object sender, DownloadStringCompletedEventArgs e)
{
if (e.Error == null)
{
string xmlString = e.Result;
XmlDocument doc = new XmlDocument();
doc.LoadXml(xmlString);
doc.Save(#"C:\CashierData\msnbc-top.xml");
}
}
Here is my previous code using XmlReader to download and parse the feed.
XmlReader reader = XmlReader.Create(feed, settings);

I suspect the problem is that the web server isn't specifying the content encoding correctly.
You could use DownloadDataAsync instead, and then use
doc.Load(new MemoryStream(data));
(where data is the byte array). Then the XML parser will get to auto-detect the XML encoding from the binary data, instead of trusting the web server to know the right encoding.

Related

Weird character encoded characters (’) appearing from a feed

I've got a question regarding an XML feed and XSL transformation I'm doing. In a few parts of the outputted feed on an HTML page, I get weird characters (such as ’) appearing on the page.
On another site (that I don't own) that's using the same feed, it isn't getting these characters.
Here's the code I'm using to grab and return the transformed content:
string xmlUrl = "http://feedurl.com/feed.xml";
string xmlData = new System.Net.WebClient().DownloadString(xmlUrl);
string xslUrl = "http://feedurl.com/transform.xsl";
XsltArgumentList xslArgs = new XsltArgumentList();
xslArgs.AddParam("type", "", "specifictype");
string resultText = Utils.XslTransform(xmlData, xslUrl, xslArgs);
return resultText;
And my Utils.XslTransform function looks like this:
static public string XslTransform(string data, string xslurl)
{
TextReader textReader = new StringReader(data);
XmlReaderSettings settings = new XmlReaderSettings();
settings.DtdProcessing = DtdProcessing.Ignore;
XmlReader xmlReader = XmlReader.Create(textReader, settings);
XmlReader xslReader = new XmlTextReader(Uri.UnescapeDataString(xslurl));
XslCompiledTransform myXslT = new XslCompiledTransform();
myXslT.Load(xslReader);
StringBuilder sb = new StringBuilder();
using (TextWriter tw = new StringWriter(sb))
{
myXslT.Transform(xmlReader, new XsltArgumentList(), tw);
}
string transformedData = sb.ToString();
return transformedData;
}
I'm not extremely knowledgeable with character encoding issues and I've been trying to nip this in the bud for a bit of time and could use any suggestions possible. I'm not sure if there's something I need to change with how the WebClient downloads the file or something going weird in the XslTransform.
Thanks!
Give HtmlEncode a try. So in this case you would reference System.Web and then make this change (just call the HtmlEncode function on the last line):
string xmlUrl = "http://feedurl.com/feed.xml";
string xmlData = new System.Net.WebClient().DownloadString(xmlUrl);
string xslUrl = "http://feedurl.com/transform.xsl";
XsltArgumentList xslArgs = new XsltArgumentList();
xslArgs.AddParam("type", "", "specifictype");
string resultText = Utils.XslTransform(xmlData, xslUrl, xslArgs);
return HttpUtility.HtmlEncode(resultText);
The character â is a marker of multibyte sequence (’) of UTF-8-encoded text when it's represented as ASCII. So, I guess, you generate an HTML file in UTF-8, while browser interprets it otherwise. I see 2 ways to fix it:
The simplest solution would be to update the XSLT to include the HTML meta tag that will hint the correct encoding to browser: <meta charset="UTF-8">.
If your transform already defines a different encoding in meta tag and you'd like to keep it, this encoding needs to be specified in the function that saves XML as file. I assume this function took ASCII by default in your example. If your XSLT was configured to generate XML files directly to disk, you could adjust it with XSLT instruction <xsl:output encoding="ASCII"/>.
To use WebClient.DownloadString you have to know what the encoding the server is going use and tell the WebClient in advance. It's a bit of a Catch-22.
But, there is no need to do that. Use WebClient.DownloadData or WebClient.OpenReader and let an XML library figure out which encoding to use.
using (var web = new WebClient())
using (var stream = web.OpenRead("http://unicode.org/repos/cldr/trunk/common/supplemental/windowsZones.xml"))
using (var reader = XmlReader.Create(stream, new XmlReaderSettings { DtdProcessing = DtdProcessing.Parse }))
{
reader.MoveToContent();
//… use reader as you will, including var doc = XDocument.ReadFrom(reader);
}

How to properly get the content of a website?

I'm trying to read the content of the page and extract some information. But sometimes I got stuff like : nbsp;Aur& eacute;lie (Verschuere)
I already do this:
string siteContent = "";
using (System.Net.WebClient client = new System.Net.WebClient())
{
client.Encoding = System.Text.Encoding.UTF8;
siteContent = client.DownloadString(edtReadFromUrl.Text);
}
It works when there are UTF-8 characters. Can't I get a readable text? with no HTML in it? It would be even easier.
Edit: It's not the same as someone marked it. It does return strange characters with the other solution too.
You could use an html parser to extract meaning. For instance, with HtmlAgilityPack, you could:
HtmlDocument doc=new HtmlDocument();
string html;
using(var wc=new WebClient())
{
html=wc.DownloadString("http://www.bbc.co.uk/news");
}
doc.LoadHtml(html);
doc.DocumentNode.Element("html").Element("body").InnerText

How to get unicode string with WebClient DownloadData?

Sorry for my bad English.
I am trying to get a string data with this code:
WebClient wc = new WebClient();
byte[] buffer = wc.DownloadData("http://......);
string xml = Encoding.UTF8.GetString(buffer);
XmlDocument doc = new XmlDocument();
doc.LoadXml(xml);
the string has Unicode data. when I get this with my browser like firefox every things are ok.
But in my code the string is broken and xml file is useless. Some characters changed to their
decimal value and when reading xml file they are only characters that we can read. and others
changed to strange signs.
Do you know how can I do?
Put your data into a stream:
var stream = new MemoryStream(buffer);
And load it with the Load method:
doc.Load(stream);
This will try to detect the correct encoding.
Or maybe WebClient.DownloadString will work as well.

How to change encoding of read RSS XML in C#

I would like to create RSS for my favourite website, but the problem is that it's RSS XML contains first line which corrupts whole RSS when parsing.
I get this error:
System does not support 'ISO-8859-2' encoding. Line 1, position 31.
Code:
void wc_OpenReadCompleted(object sender, OpenReadCompletedEventArgs e)
{
SyndicationFeed feed;
try {
using (XmlReader reader = XmlReader.Create(e.Result)) {
// I WOULD LIKE to delete some rows from the Result
feed = SyndicationFeed.Load(reader);
lista.ItemsSource = feed.Items;
}
} catch (WebException we) {
MessageBox.Show("The internet connection is down.");
}
}
Apparently the .NET framework used in WP7 doesn't support other encodings than UTF-8 and ISO-8859-1. What you could do, is to generate your own encoding implementation using this tool.
Then you read the stream with a detour via a StreamReader using the custom encoding:
using ( StreamReader sReader = new StreamReader(e.Result, new CustomEncoding()) )
using ( XmlReader xReader = XmlReader.Create(sReader) )
{
//...
}
You could either try to re-encode the string that is in e.Result, perhaps by using the Encoding.Convert method in .NET. But this will probably not be enough since I assume that there is a encoding="ISO-8859-2" attribute in the xml-code. So you will probably also need to do a String.Replace that attribute with something else.
Or just try to replace the attribute with another one and see if that works. Do a e.Result.Replace("ISO-8859-2", "UTF-8") and see what happens. If that doesn't work, try the first option of converting the strings encoding to another one and then to the replace.

upload XML -> read unicode stream and convert it

I have a fileupload control where i can upload xml documents.
The XML files will be encoded in unicode format. I want to convert them to UTF8, so they can render as a proper xml file.
Im saving the uploaded file in a hiddenfield as a hex string and sends it to a generic handler. What i want is a result that i can create an xml from. At the moment my string looks like this:
"??<\0?\0x\0m\0l\0 \0v\0e\0r\0s\0i\0o\0n\0=\0\"\01\0.\00\0\"\0 \0e\0n\0c\0o\0d\0i\0n\0g\0=\0\"\0I\0S\0O\0-
Instead of
<?xml version="1.0".. etc
Code:
if (fileUpload.PostedFile.ContentType == "text/xml")
{
Stream inputstream = fileUpload.PostedFile.InputStream;
byte[] streamAsBytes = (ConvertStreamToByteArray(inputstream));
string stringToSend = BitConverter.ToString(streamAsBytes);
xmlstream.Value = stringToSend;
sendXML.Visible = true;
infoLabel.Text = "<b>Selected XML: </b>" + fileUpload.PostedFile.FileName;
}
handler.ashx:
if (HttpContext.Current.Request.Form["xmldata"] != null)
{
HttpContext.Current.Response.ContentType = "text/xml";
HttpContext.Current.Response.ContentEncoding = Encoding.UTF8;
string xmlstring = HttpContext.Current.Request.Form["xmldata"];
byte[] data = xmlstring.Split('-').Select(b => Convert.ToByte(b, 16)).ToArray();
string complete = System.Text.ASCIIEncoding.ASCII.GetString(data);
XmlDocument doc = new XmlDocument();
doc.LoadXml(complete);
HttpContext.Current.Response.Write(doc.InnerXml);
}
Thanks!
It's not at all clear that you really should do this. XML files can declare their own encoding, and it looks like yours is declaring an encoding starting with "ISO" (that's where the data you've given us stops). That's probably not UTF-8.
Basically, I don't think you should be treating the data as text in handler.ashx. Just get XmlDocument to parse it from a stream. It's not really clear exactly how your upload code is sending the data, but you should try to mess with it as little as possible.
It's possible that your current code would actually work fine if you just changed this:
string complete = System.Text.ASCIIEncoding.ASCII.GetString(data);
XmlDocument doc = new XmlDocument();
doc.LoadXml(complete);
to this:
XmlDocument doc = new XmlDocument();
doc.Load(new MemoryStream(data));
However, the hex part is pretty ugly. If you really need to represent the binary data as text, I'd strongly recommend using Base64 instead of hex:
string text = Convert.ToBase64String(binary);
...
byte[] binary = Convert.FromBase64String(text);
... there's no need to convert each byte separately and split the string on hyphens etc.

Categories