upload XML -> read unicode stream and convert it - c#

I have a fileupload control where i can upload xml documents.
The XML files will be encoded in unicode format. I want to convert them to UTF8, so they can render as a proper xml file.
Im saving the uploaded file in a hiddenfield as a hex string and sends it to a generic handler. What i want is a result that i can create an xml from. At the moment my string looks like this:
"??<\0?\0x\0m\0l\0 \0v\0e\0r\0s\0i\0o\0n\0=\0\"\01\0.\00\0\"\0 \0e\0n\0c\0o\0d\0i\0n\0g\0=\0\"\0I\0S\0O\0-
Instead of
<?xml version="1.0".. etc
Code:
if (fileUpload.PostedFile.ContentType == "text/xml")
{
Stream inputstream = fileUpload.PostedFile.InputStream;
byte[] streamAsBytes = (ConvertStreamToByteArray(inputstream));
string stringToSend = BitConverter.ToString(streamAsBytes);
xmlstream.Value = stringToSend;
sendXML.Visible = true;
infoLabel.Text = "<b>Selected XML: </b>" + fileUpload.PostedFile.FileName;
}
handler.ashx:
if (HttpContext.Current.Request.Form["xmldata"] != null)
{
HttpContext.Current.Response.ContentType = "text/xml";
HttpContext.Current.Response.ContentEncoding = Encoding.UTF8;
string xmlstring = HttpContext.Current.Request.Form["xmldata"];
byte[] data = xmlstring.Split('-').Select(b => Convert.ToByte(b, 16)).ToArray();
string complete = System.Text.ASCIIEncoding.ASCII.GetString(data);
XmlDocument doc = new XmlDocument();
doc.LoadXml(complete);
HttpContext.Current.Response.Write(doc.InnerXml);
}
Thanks!

It's not at all clear that you really should do this. XML files can declare their own encoding, and it looks like yours is declaring an encoding starting with "ISO" (that's where the data you've given us stops). That's probably not UTF-8.
Basically, I don't think you should be treating the data as text in handler.ashx. Just get XmlDocument to parse it from a stream. It's not really clear exactly how your upload code is sending the data, but you should try to mess with it as little as possible.
It's possible that your current code would actually work fine if you just changed this:
string complete = System.Text.ASCIIEncoding.ASCII.GetString(data);
XmlDocument doc = new XmlDocument();
doc.LoadXml(complete);
to this:
XmlDocument doc = new XmlDocument();
doc.Load(new MemoryStream(data));
However, the hex part is pretty ugly. If you really need to represent the binary data as text, I'd strongly recommend using Base64 instead of hex:
string text = Convert.ToBase64String(binary);
...
byte[] binary = Convert.FromBase64String(text);
... there's no need to convert each byte separately and split the string on hyphens etc.

Related

Weird character encoded characters (’) appearing from a feed

I've got a question regarding an XML feed and XSL transformation I'm doing. In a few parts of the outputted feed on an HTML page, I get weird characters (such as ’) appearing on the page.
On another site (that I don't own) that's using the same feed, it isn't getting these characters.
Here's the code I'm using to grab and return the transformed content:
string xmlUrl = "http://feedurl.com/feed.xml";
string xmlData = new System.Net.WebClient().DownloadString(xmlUrl);
string xslUrl = "http://feedurl.com/transform.xsl";
XsltArgumentList xslArgs = new XsltArgumentList();
xslArgs.AddParam("type", "", "specifictype");
string resultText = Utils.XslTransform(xmlData, xslUrl, xslArgs);
return resultText;
And my Utils.XslTransform function looks like this:
static public string XslTransform(string data, string xslurl)
{
TextReader textReader = new StringReader(data);
XmlReaderSettings settings = new XmlReaderSettings();
settings.DtdProcessing = DtdProcessing.Ignore;
XmlReader xmlReader = XmlReader.Create(textReader, settings);
XmlReader xslReader = new XmlTextReader(Uri.UnescapeDataString(xslurl));
XslCompiledTransform myXslT = new XslCompiledTransform();
myXslT.Load(xslReader);
StringBuilder sb = new StringBuilder();
using (TextWriter tw = new StringWriter(sb))
{
myXslT.Transform(xmlReader, new XsltArgumentList(), tw);
}
string transformedData = sb.ToString();
return transformedData;
}
I'm not extremely knowledgeable with character encoding issues and I've been trying to nip this in the bud for a bit of time and could use any suggestions possible. I'm not sure if there's something I need to change with how the WebClient downloads the file or something going weird in the XslTransform.
Thanks!
Give HtmlEncode a try. So in this case you would reference System.Web and then make this change (just call the HtmlEncode function on the last line):
string xmlUrl = "http://feedurl.com/feed.xml";
string xmlData = new System.Net.WebClient().DownloadString(xmlUrl);
string xslUrl = "http://feedurl.com/transform.xsl";
XsltArgumentList xslArgs = new XsltArgumentList();
xslArgs.AddParam("type", "", "specifictype");
string resultText = Utils.XslTransform(xmlData, xslUrl, xslArgs);
return HttpUtility.HtmlEncode(resultText);
The character â is a marker of multibyte sequence (’) of UTF-8-encoded text when it's represented as ASCII. So, I guess, you generate an HTML file in UTF-8, while browser interprets it otherwise. I see 2 ways to fix it:
The simplest solution would be to update the XSLT to include the HTML meta tag that will hint the correct encoding to browser: <meta charset="UTF-8">.
If your transform already defines a different encoding in meta tag and you'd like to keep it, this encoding needs to be specified in the function that saves XML as file. I assume this function took ASCII by default in your example. If your XSLT was configured to generate XML files directly to disk, you could adjust it with XSLT instruction <xsl:output encoding="ASCII"/>.
To use WebClient.DownloadString you have to know what the encoding the server is going use and tell the WebClient in advance. It's a bit of a Catch-22.
But, there is no need to do that. Use WebClient.DownloadData or WebClient.OpenReader and let an XML library figure out which encoding to use.
using (var web = new WebClient())
using (var stream = web.OpenRead("http://unicode.org/repos/cldr/trunk/common/supplemental/windowsZones.xml"))
using (var reader = XmlReader.Create(stream, new XmlReaderSettings { DtdProcessing = DtdProcessing.Parse }))
{
reader.MoveToContent();
//… use reader as you will, including var doc = XDocument.ReadFrom(reader);
}

Encode XDocumnet form win-1251 to utf-8

I try to convert XDocument from win-1 to utf-8. But in raw-view russian characters have bad view.
var encoding = new UTF8Encoding(false,false);
XmlTextWriter xmlTextWriter = new XmlTextWriter("F:\\File", Encoding.GetEncoding("windows-1251"));
document.Save(xmlTextWriter);
xmlTextWriter.Close();
xmlTextWriter = null;
string text = File.ReadAllText("F:\\File", Encoding.Default);
XDocument documentcode = XDocument.Parse(text);
xmlTextWriter = new XmlTextWriter(_Stream, encoding);
documentcode.Save(xmlTextWriter);
xmlTextWriter.Flush();
_Stream.Position = 0;
Headers.ContentType = new MediaTypeHeaderValue("application/xml");
This is the raw-view in SOAPUI
<?xml version="1.0" encoding="utf-8"?><StatObservationList><StatObservation><ObjectID>0b575ec1-7dea-41c4-a1f0-287190715ed2</ObjectID><Name>Тестовое статнаблюдение</Name><Code>GPPCode42</Code></StatObservation><StatObservation><ObjectID>3a871ea1-06ee-4991-a263-d643b424bdd4</ObjectID><Name>МиСП</Name><Code /></StatObservation></StatObservationList>
I think I've got it now. The text in your XDocument has, for whatever reason, been decoded incorrectly using Windows-1251.
Ideally, you need to go back to the source and ensure it is decoded properly (with UTF8). Converting this may not be an entirely loss-free process, as there are code points in the UTF8 that don't have a representation in Windows-1251 (a quick glance at the code page shows nothing for 0x98, for example).
However, to convert this after the fact the simplest way is just to get the text back, get the bytes for the encoding it was decoded with and then decode those with the correct encoding:
var windows1251 = Encoding.GetEncoding("windows-1251");
var utf8 = Encoding.UTF8;
var originalBytes = windows1251.GetBytes(document.ToString());
var correctXmlString = utf8.GetString(originalBytes);
var correctDocument = XDocument.Parse(correctXmlString);

How to get unicode string with WebClient DownloadData?

Sorry for my bad English.
I am trying to get a string data with this code:
WebClient wc = new WebClient();
byte[] buffer = wc.DownloadData("http://......);
string xml = Encoding.UTF8.GetString(buffer);
XmlDocument doc = new XmlDocument();
doc.LoadXml(xml);
the string has Unicode data. when I get this with my browser like firefox every things are ok.
But in my code the string is broken and xml file is useless. Some characters changed to their
decimal value and when reading xml file they are only characters that we can read. and others
changed to strange signs.
Do you know how can I do?
Put your data into a stream:
var stream = new MemoryStream(buffer);
And load it with the Load method:
doc.Load(stream);
This will try to detect the correct encoding.
Or maybe WebClient.DownloadString will work as well.

Convert utf-8 XML document to utf-16 for inserting into SQL

I have an XML document that has been created using utf-8 encoding. I want to store that document in a sql 2008 xml column but I understand I need to convert it to utf-16 in order to do that.
I've tried using XDocument to do this but I'm not getting a valid XML result after the conversion. Here is what I've tried to do the conversion on (Utf8StringWriter is a small class that inherits from StringWriter and overloads Encoding):
XDocument xDoc = XDocument.Parse(utf8Xml);
StringWriter writer = new StringWriter();
XmlWriter xml = XmlWriter.Create(writer, new XmlWriterSettings()
{ Encoding = writer.Encoding, Indent = true });
xDoc.WriteTo(xml);
string utf16Xml = writer.ToString();
The data in the utf16Xml is invalid and when trying to insert into the database I get the error:
{"XML parsing: line 1, character 38, unable to switch the encoding"}
However the initial utf8Xml data is definitely valid and contains all the info I need.
UPDATE:
The initial XML is obtained by using XMLSerializer (with an Utf8StringWriter class) to create the xml string from an existing object model (engine). The code for this is:
public static void Serialise<T>(T engine, ref StringWriter writer)
{
XmlWriter xml = XmlWriter.Create(writer, new XmlWriterSettings() { Encoding = writer.Encoding });
XmlSerializer xs = new XmlSerializer(engine.GetType());
xs.Serialize(xml, engine);
}
I have to leave this like this as that code is out of my control to change.
Before I even send the utf16Xml string to the failing database call I can view it via the Visual Studio debugger and I notice that the entire string is not present and instead I get a string literal was not closed error on the XML viewer.
The error is on first line XDocument xDoc = XDocument.Parse(utf8Xml);. Most likely you converted utf8 stream into a string (utf8xml), but encoding specified in the string is still utf-8, so XML reader fails. If it is true than load XML directly from stream using Load instead of converting it to string first.
Set the encoding of the document to UTF-16 after you have parsed it from utf8xml
XDocument xDoc = XDocument.Parse(utf8Xml);
xDoc.Declaration.Encoding = "utf-16";
StringWriter writer = new StringWriter();
XmlWriter xml = XmlWriter.Create(writer, new XmlWriterSettings()
{ Encoding = writer.Encoding, Indent = true });
xDoc.WriteTo(xml);
string utf16Xml = writer.ToString();
Here's what I had to do to make it work. This just converts the XML to utf-16
string getUtf16Xml(System.Xml.XmlDocument xmlDoc)
{
System.Xml.Linq.XDocument xDoc = System.Xml.Linq.XDocument.Parse(xmlDoc.OuterXml);
xDoc.Declaration.Encoding = "utf-16";
return xDoc.ToString();
}
Then I can save the results to the DB.

XMLDocument, problem reading Node

I am doing the following:
System.Net.WebRequest myRequest = System.Net.WebRequest.Create("http://www.atlantawithkid.com/feed/");
System.Net.WebResponse myResponse = myRequest.GetResponse();
System.IO.Stream rssStream = myResponse.GetResponseStream();
System.Xml.XmlDocument rssDoc = new System.Xml.XmlDocument();
rssDoc.Load(rssStream);
System.Xml.XmlNodeList rssItems = rssDoc.SelectNodes("rss/channel/item");
System.Xml.XmlNode rssDetail;
// FEED DESCRIPTION
string sRssDescription;
rssDetail = rssItems.Item(0).SelectSingleNode("description");
if (rssDetail != null)
sRssDescription = rssDetail.InnerText;
But, when I read the "description" node and view the InnerText, or the InnerXML, the string is different than in the original XML document.
The string return has and ellipses and the data si truncated. However, in the original XML document there is data that I can see.
Is there a way to select this node without the data being altered?
Thanks for the help.
I suspect you're looking at the string in the debugger, and that may be truncating the data. (Or you're writing it into something else which truncates text.)
I very much doubt that this is an XmlDocument problem.
I suggest you log the InnerText somewhere that you know you'll be able to get full data out, so you can tell for sure.

Categories