Encode XDocumnet form win-1251 to utf-8 - c#

I try to convert XDocument from win-1 to utf-8. But in raw-view russian characters have bad view.
var encoding = new UTF8Encoding(false,false);
XmlTextWriter xmlTextWriter = new XmlTextWriter("F:\\File", Encoding.GetEncoding("windows-1251"));
document.Save(xmlTextWriter);
xmlTextWriter.Close();
xmlTextWriter = null;
string text = File.ReadAllText("F:\\File", Encoding.Default);
XDocument documentcode = XDocument.Parse(text);
xmlTextWriter = new XmlTextWriter(_Stream, encoding);
documentcode.Save(xmlTextWriter);
xmlTextWriter.Flush();
_Stream.Position = 0;
Headers.ContentType = new MediaTypeHeaderValue("application/xml");
This is the raw-view in SOAPUI
<?xml version="1.0" encoding="utf-8"?><StatObservationList><StatObservation><ObjectID>0b575ec1-7dea-41c4-a1f0-287190715ed2</ObjectID><Name>Тестовое статнаблюдение</Name><Code>GPPCode42</Code></StatObservation><StatObservation><ObjectID>3a871ea1-06ee-4991-a263-d643b424bdd4</ObjectID><Name>МиСП</Name><Code /></StatObservation></StatObservationList>

I think I've got it now. The text in your XDocument has, for whatever reason, been decoded incorrectly using Windows-1251.
Ideally, you need to go back to the source and ensure it is decoded properly (with UTF8). Converting this may not be an entirely loss-free process, as there are code points in the UTF8 that don't have a representation in Windows-1251 (a quick glance at the code page shows nothing for 0x98, for example).
However, to convert this after the fact the simplest way is just to get the text back, get the bytes for the encoding it was decoded with and then decode those with the correct encoding:
var windows1251 = Encoding.GetEncoding("windows-1251");
var utf8 = Encoding.UTF8;
var originalBytes = windows1251.GetBytes(document.ToString());
var correctXmlString = utf8.GetString(originalBytes);
var correctDocument = XDocument.Parse(correctXmlString);

Related

Encoding xml as ISO-8859-1

I sent an xml file which I created while serializing an object and received a response that it is incorrect and not well-formed:
<?xml version="1.0" encoding="utf-8"?>
Moreover, I am supposed to use ISO-8859-1.
I assume that I not only have to change <?xml version="1.0" encoding="ISO-8859-1"?>, but additionally I have to create the file during serialization from the code already with encoding ISO-8859-1. Correct?
I am doint it this way:
XmlSerializer ser = new XmlSerializer(obj.GetType());
var encoding = Encoding.GetEncoding("ISO-8859-1");
XmlWriterSettings xmlWriterSettings = new XmlWriterSettings
{
Indent = true,
OmitXmlDeclaration = false,
Encoding = encoding
};
XmlDocument xd = null;
using (MemoryStream memStm = new MemoryStream())
{
using (var xmlWriter = XmlWriter.Create(memStm, xmlWriterSettings))
{
ser.Serialize(xmlWriter, input);
}
memStm.Position = 0;
XmlReaderSettings settings = new XmlReaderSettings();
using (var xtr = XmlReader.Create(memStm, settings))
{
xd = new XmlDocument();
xd.Load(xtr);
}
}
byte[] file = encoding.GetBytes(xml.OuterXml);
I used a framework to find out what encoding my created files have and when I create them with ISO-8859-1 as above my encoding checker gives me ASCII, is that correct?
I sent an xml file which I created while serializing an object and received a response that it is incorrect and not well-formed:
The  chars represents a BOM (byte-order-mark) for utf-8 files. That BOM can be a part of utf-8 encoded files. So your xml is valid if read properly.
More information about BOM: http://www.unicode.org/faq/utf_bom.html#bom1
I assume that I not only have to change <?xml version="1.0" encoding="ISO-8859-1"?>, but additionally I have to create the file during serialization from the code already with encoding ISO-8859-1. Correct?
Correct.
I used a framework to find out what encoding my created files have and when I create them with ISO-8859-1 as above my encoding checker gives me ASCII, is that correct?
So, the encoding of a text file cannot be determined exactly, but only "guessed" by means of an analysis. Various encodings have the same code pages for the ASCII characters, therefore ASCII is suitable as result.

Weird character encoded characters (’) appearing from a feed

I've got a question regarding an XML feed and XSL transformation I'm doing. In a few parts of the outputted feed on an HTML page, I get weird characters (such as ’) appearing on the page.
On another site (that I don't own) that's using the same feed, it isn't getting these characters.
Here's the code I'm using to grab and return the transformed content:
string xmlUrl = "http://feedurl.com/feed.xml";
string xmlData = new System.Net.WebClient().DownloadString(xmlUrl);
string xslUrl = "http://feedurl.com/transform.xsl";
XsltArgumentList xslArgs = new XsltArgumentList();
xslArgs.AddParam("type", "", "specifictype");
string resultText = Utils.XslTransform(xmlData, xslUrl, xslArgs);
return resultText;
And my Utils.XslTransform function looks like this:
static public string XslTransform(string data, string xslurl)
{
TextReader textReader = new StringReader(data);
XmlReaderSettings settings = new XmlReaderSettings();
settings.DtdProcessing = DtdProcessing.Ignore;
XmlReader xmlReader = XmlReader.Create(textReader, settings);
XmlReader xslReader = new XmlTextReader(Uri.UnescapeDataString(xslurl));
XslCompiledTransform myXslT = new XslCompiledTransform();
myXslT.Load(xslReader);
StringBuilder sb = new StringBuilder();
using (TextWriter tw = new StringWriter(sb))
{
myXslT.Transform(xmlReader, new XsltArgumentList(), tw);
}
string transformedData = sb.ToString();
return transformedData;
}
I'm not extremely knowledgeable with character encoding issues and I've been trying to nip this in the bud for a bit of time and could use any suggestions possible. I'm not sure if there's something I need to change with how the WebClient downloads the file or something going weird in the XslTransform.
Thanks!
Give HtmlEncode a try. So in this case you would reference System.Web and then make this change (just call the HtmlEncode function on the last line):
string xmlUrl = "http://feedurl.com/feed.xml";
string xmlData = new System.Net.WebClient().DownloadString(xmlUrl);
string xslUrl = "http://feedurl.com/transform.xsl";
XsltArgumentList xslArgs = new XsltArgumentList();
xslArgs.AddParam("type", "", "specifictype");
string resultText = Utils.XslTransform(xmlData, xslUrl, xslArgs);
return HttpUtility.HtmlEncode(resultText);
The character â is a marker of multibyte sequence (’) of UTF-8-encoded text when it's represented as ASCII. So, I guess, you generate an HTML file in UTF-8, while browser interprets it otherwise. I see 2 ways to fix it:
The simplest solution would be to update the XSLT to include the HTML meta tag that will hint the correct encoding to browser: <meta charset="UTF-8">.
If your transform already defines a different encoding in meta tag and you'd like to keep it, this encoding needs to be specified in the function that saves XML as file. I assume this function took ASCII by default in your example. If your XSLT was configured to generate XML files directly to disk, you could adjust it with XSLT instruction <xsl:output encoding="ASCII"/>.
To use WebClient.DownloadString you have to know what the encoding the server is going use and tell the WebClient in advance. It's a bit of a Catch-22.
But, there is no need to do that. Use WebClient.DownloadData or WebClient.OpenReader and let an XML library figure out which encoding to use.
using (var web = new WebClient())
using (var stream = web.OpenRead("http://unicode.org/repos/cldr/trunk/common/supplemental/windowsZones.xml"))
using (var reader = XmlReader.Create(stream, new XmlReaderSettings { DtdProcessing = DtdProcessing.Parse }))
{
reader.MoveToContent();
//… use reader as you will, including var doc = XDocument.ReadFrom(reader);
}

How to get unicode string with WebClient DownloadData?

Sorry for my bad English.
I am trying to get a string data with this code:
WebClient wc = new WebClient();
byte[] buffer = wc.DownloadData("http://......);
string xml = Encoding.UTF8.GetString(buffer);
XmlDocument doc = new XmlDocument();
doc.LoadXml(xml);
the string has Unicode data. when I get this with my browser like firefox every things are ok.
But in my code the string is broken and xml file is useless. Some characters changed to their
decimal value and when reading xml file they are only characters that we can read. and others
changed to strange signs.
Do you know how can I do?
Put your data into a stream:
var stream = new MemoryStream(buffer);
And load it with the Load method:
doc.Load(stream);
This will try to detect the correct encoding.
Or maybe WebClient.DownloadString will work as well.

upload XML -> read unicode stream and convert it

I have a fileupload control where i can upload xml documents.
The XML files will be encoded in unicode format. I want to convert them to UTF8, so they can render as a proper xml file.
Im saving the uploaded file in a hiddenfield as a hex string and sends it to a generic handler. What i want is a result that i can create an xml from. At the moment my string looks like this:
"??<\0?\0x\0m\0l\0 \0v\0e\0r\0s\0i\0o\0n\0=\0\"\01\0.\00\0\"\0 \0e\0n\0c\0o\0d\0i\0n\0g\0=\0\"\0I\0S\0O\0-
Instead of
<?xml version="1.0".. etc
Code:
if (fileUpload.PostedFile.ContentType == "text/xml")
{
Stream inputstream = fileUpload.PostedFile.InputStream;
byte[] streamAsBytes = (ConvertStreamToByteArray(inputstream));
string stringToSend = BitConverter.ToString(streamAsBytes);
xmlstream.Value = stringToSend;
sendXML.Visible = true;
infoLabel.Text = "<b>Selected XML: </b>" + fileUpload.PostedFile.FileName;
}
handler.ashx:
if (HttpContext.Current.Request.Form["xmldata"] != null)
{
HttpContext.Current.Response.ContentType = "text/xml";
HttpContext.Current.Response.ContentEncoding = Encoding.UTF8;
string xmlstring = HttpContext.Current.Request.Form["xmldata"];
byte[] data = xmlstring.Split('-').Select(b => Convert.ToByte(b, 16)).ToArray();
string complete = System.Text.ASCIIEncoding.ASCII.GetString(data);
XmlDocument doc = new XmlDocument();
doc.LoadXml(complete);
HttpContext.Current.Response.Write(doc.InnerXml);
}
Thanks!
It's not at all clear that you really should do this. XML files can declare their own encoding, and it looks like yours is declaring an encoding starting with "ISO" (that's where the data you've given us stops). That's probably not UTF-8.
Basically, I don't think you should be treating the data as text in handler.ashx. Just get XmlDocument to parse it from a stream. It's not really clear exactly how your upload code is sending the data, but you should try to mess with it as little as possible.
It's possible that your current code would actually work fine if you just changed this:
string complete = System.Text.ASCIIEncoding.ASCII.GetString(data);
XmlDocument doc = new XmlDocument();
doc.LoadXml(complete);
to this:
XmlDocument doc = new XmlDocument();
doc.Load(new MemoryStream(data));
However, the hex part is pretty ugly. If you really need to represent the binary data as text, I'd strongly recommend using Base64 instead of hex:
string text = Convert.ToBase64String(binary);
...
byte[] binary = Convert.FromBase64String(text);
... there's no need to convert each byte separately and split the string on hyphens etc.

How to get Xml as string from XDocument?

I am new to LINQ to XML. After you have built XDocument, how do you get the OuterXml of it like you did with XmlDocument?
You only need to use the overridden ToString() method of the object:
XDocument xmlDoc ...
string xml = xmlDoc.ToString();
This works with all XObjects, like XElement, etc.
I don't know when this changed, but today (July 2017) when trying the answers out, I got
"System.Xml.XmlDocument"
Instead of ToString(), you can use the originally intended way accessing the XmlDocument content: writing the xml doc to a stream.
XmlDocument xml = ...;
string result;
using (StringWriter writer = new StringWriter())
{
xml.Save(writer);
result = writer.ToString();
}
Several responses give a slightly incorrect answer.
XDocument.ToString() omits the XML declaration (and, according to #Alex Gordon, may return invalid XML if it contains encoded unusual characters like &).
Saving XDocument to StringWriter will cause .NET to emit encoding="utf-16", which you most likely don't want (if you save XML as a string, it's probably because you want to later save it as a file, and de facto standard for saving files is UTF-8 - .NET saves text files as UTF-8 unless specified otherwise).
#Wolfgang Grinfeld's answer is heading in the right direction, but it's unnecessarily complex.
Use the following:
var memory = new MemoryStream();
xDocument.Save(memory);
string xmlText = Encoding.UTF8.GetString(memory.ToArray());
This will return XML text with UTF-8 declaration.
Doing XDocument.ToString() may not get you the full XML.
In order to get the XML declaration at the start of the XML document as a string, use the XDocument.Save() method:
var ms = new MemoryStream();
using (var xw = XmlWriter.Create(new StreamWriter(ms, Encoding.GetEncoding("ISO-8859-1"))))
new XDocument(new XElement("Root", new XElement("Leaf", "data"))).Save(xw);
var myXml = Encoding.GetEncoding("ISO-8859-1").GetString(ms.ToArray());
Use ToString() to convert XDocument into a string:
string result = string.Empty;
XElement root = new XElement("xml",
new XElement("MsgType", "<![CDATA[" + "text" + "]]>"),
new XElement("Content", "<![CDATA[" + "Hi, this is Wilson Wu Testing for you! You can ask any question but no answer can be replied...." + "]]>"),
new XElement("FuncFlag", 0)
);
result = root.ToString();
While #wolfgang-grinfeld's answer is technically correct (as it also produces the XML declaration, as opposed to just using .ToString() method), the code generated UTF-8 byte order mark (BOM), which for some reason XDocument.Parse(string) method cannot process and throws Data at the root level is invalid. Line 1, position 1. error.
So here is a another solution without the BOM:
var utf8Encoding =
new UTF8Encoding(encoderShouldEmitUTF8Identifier: false);
using (var memory = new MemoryStream())
using (var writer = XmlWriter.Create(memory, new XmlWriterSettings
{
OmitXmlDeclaration = false,
Encoding = utf8Encoding
}))
{
CompanyDataXml.Save(writer);
writer.Flush();
return utf8Encoding.GetString(memory.ToArray());
}
I found this example in the Microsoft .NET 6 documentation for XDocument.Save method. I think it answers the original question (what is the XDocument equivalent for XmlDocument.OuterXml), and also addresses the concerns that others have pointed out already. By using the XmlWritingSettings you can predictably control the string output.
https://learn.microsoft.com/en-us/dotnet/api/system.xml.linq.xdocument.save
StringBuilder sb = new StringBuilder();
XmlWriterSettings xws = new XmlWriterSettings();
xws.OmitXmlDeclaration = true;
xws.Indent = true;
using (XmlWriter xw = XmlWriter.Create(sb, xws)) {
XDocument doc = new XDocument(
new XElement("Child",
new XElement("GrandChild", "some content")
)
);
doc.Save(xw);
}
Console.WriteLine(sb.ToString());
Looking at these answers, I see a lot of unnecessary complexity and inefficiency in pursuit of generating the XML declaration automatically. But since the declaration is so simple, there isn't much value in generating it. Just KISS (keep it simple, stupid):
// Extension method
public static string ToStringWithDeclaration(this XDocument doc, string declaration = null)
{
declaration ??= "<?xml version=\"1.0\" encoding=\"utf-8\"?>\r\n";
return declaration + doc.ToString();
}
// Usage
string xmlString = doc.ToStringWithDeclaration();
// Or
string xmlString = doc.ToStringWithDeclaration("...");
Using XmlWriter instead of ToString() can give you more control over how the output is formatted (such as if you want indentation), and it can write to other targets besides string.
The reason to target a memory stream is performance. It lets you skip the step of storing the XML in a string (since you know the data must end up in a different encoding eventually, whereas string is always UTF-16 in C#). For instance, for an HTTP request:
// Extension method
public static ByteArrayContent ToByteArrayContent(
this XDocument doc, XmlWriterSettings xmlWriterSettings = null)
{
xmlWriterSettings ??= new XmlWriterSettings();
using (var stream = new MemoryStream())
{
using (var writer = XmlWriter.Create(stream, xmlWriterSettings))
{
doc.Save(writer);
}
var content = new ByteArrayContent(stream.GetBuffer(), 0, (int)stream.Length);
content.Headers.ContentType = new MediaTypeHeaderValue("text/xml");
return content;
}
}
// Usage (XDocument -> UTF-8 bytes)
var content = doc.ToByteArrayContent();
var response = await httpClient.PostAsync("/someurl", content);
// Alternative (XDocument -> string -> UTF-8 bytes)
var content = new StringContent(doc.ToStringWithDeclaration(), Encoding.UTF8, "text/xml");
var response = await httpClient.PostAsync("/someurl", content);

Categories