Weird character encoded characters (’) appearing from a feed - c#

I've got a question regarding an XML feed and XSL transformation I'm doing. In a few parts of the outputted feed on an HTML page, I get weird characters (such as ’) appearing on the page.
On another site (that I don't own) that's using the same feed, it isn't getting these characters.
Here's the code I'm using to grab and return the transformed content:
string xmlUrl = "http://feedurl.com/feed.xml";
string xmlData = new System.Net.WebClient().DownloadString(xmlUrl);
string xslUrl = "http://feedurl.com/transform.xsl";
XsltArgumentList xslArgs = new XsltArgumentList();
xslArgs.AddParam("type", "", "specifictype");
string resultText = Utils.XslTransform(xmlData, xslUrl, xslArgs);
return resultText;
And my Utils.XslTransform function looks like this:
static public string XslTransform(string data, string xslurl)
{
TextReader textReader = new StringReader(data);
XmlReaderSettings settings = new XmlReaderSettings();
settings.DtdProcessing = DtdProcessing.Ignore;
XmlReader xmlReader = XmlReader.Create(textReader, settings);
XmlReader xslReader = new XmlTextReader(Uri.UnescapeDataString(xslurl));
XslCompiledTransform myXslT = new XslCompiledTransform();
myXslT.Load(xslReader);
StringBuilder sb = new StringBuilder();
using (TextWriter tw = new StringWriter(sb))
{
myXslT.Transform(xmlReader, new XsltArgumentList(), tw);
}
string transformedData = sb.ToString();
return transformedData;
}
I'm not extremely knowledgeable with character encoding issues and I've been trying to nip this in the bud for a bit of time and could use any suggestions possible. I'm not sure if there's something I need to change with how the WebClient downloads the file or something going weird in the XslTransform.
Thanks!

Give HtmlEncode a try. So in this case you would reference System.Web and then make this change (just call the HtmlEncode function on the last line):
string xmlUrl = "http://feedurl.com/feed.xml";
string xmlData = new System.Net.WebClient().DownloadString(xmlUrl);
string xslUrl = "http://feedurl.com/transform.xsl";
XsltArgumentList xslArgs = new XsltArgumentList();
xslArgs.AddParam("type", "", "specifictype");
string resultText = Utils.XslTransform(xmlData, xslUrl, xslArgs);
return HttpUtility.HtmlEncode(resultText);

The character â is a marker of multibyte sequence (’) of UTF-8-encoded text when it's represented as ASCII. So, I guess, you generate an HTML file in UTF-8, while browser interprets it otherwise. I see 2 ways to fix it:
The simplest solution would be to update the XSLT to include the HTML meta tag that will hint the correct encoding to browser: <meta charset="UTF-8">.
If your transform already defines a different encoding in meta tag and you'd like to keep it, this encoding needs to be specified in the function that saves XML as file. I assume this function took ASCII by default in your example. If your XSLT was configured to generate XML files directly to disk, you could adjust it with XSLT instruction <xsl:output encoding="ASCII"/>.

To use WebClient.DownloadString you have to know what the encoding the server is going use and tell the WebClient in advance. It's a bit of a Catch-22.
But, there is no need to do that. Use WebClient.DownloadData or WebClient.OpenReader and let an XML library figure out which encoding to use.
using (var web = new WebClient())
using (var stream = web.OpenRead("http://unicode.org/repos/cldr/trunk/common/supplemental/windowsZones.xml"))
using (var reader = XmlReader.Create(stream, new XmlReaderSettings { DtdProcessing = DtdProcessing.Parse }))
{
reader.MoveToContent();
//… use reader as you will, including var doc = XDocument.ReadFrom(reader);
}

Related

Hidden character in saved XML [duplicate]

I'm generating an utf-8 XML file using XDocument.
XDocument xml_document = new XDocument(
new XDeclaration("1.0", "utf-8", null),
new XElement(ROOT_NAME,
new XAttribute("note", note)
)
);
...
xml_document.Save(#file_path);
The file is generated correctly and validated with an xsd file with success.
When I try to upload the XML file to an online service, the service says that my file is wrong at line 1; I have discovered that the problem is caused by the BOM on the first bytes of the file.
Do you know why the BOM is appended to the file and how can I save the file without it?
As stated in Byte order mark Wikipedia article:
While Unicode standard allows BOM in
UTF-8 it does not require or
recommend it. Byte order has no
meaning in UTF-8 so a BOM only
serves to identify a text stream or
file as UTF-8 or that it was converted
from another format that has a BOM
Is it an XDocument problem or should I contact the guys of the online service provider to ask for a parser upgrade?
Use an XmlTextWriter and pass that to the XDocument's Save() method, that way you can have more control over the type of encoding used:
var doc = new XDocument(
new XDeclaration("1.0", "utf-8", null),
new XElement("root", new XAttribute("note", "boogers"))
);
using (var writer = new XmlTextWriter(".\\boogers.xml", new UTF8Encoding(false)))
{
doc.Save(writer);
}
The UTF8Encoding class constructor has an overload that specifies whether or not to use the BOM (Byte Order Mark) with a boolean value, in your case false.
The result of this code was verified using Notepad++ to inspect the file's encoding.
First of all: the service provider MUST handle it, according to XML spec, which states that BOM may be present in case of UTF-8 representation.
You can force to save your XML without BOM like this:
XmlWriterSettings settings = new XmlWriterSettings();
settings.Encoding = new UTF8Encoding(false); // The false means, do not emit the BOM.
using (XmlWriter w = XmlWriter.Create("my.xml", settings))
{
doc.Save(w);
}
(Googled from here: http://social.msdn.microsoft.com/Forums/en/xmlandnetfx/thread/ccc08c65-01d7-43c6-adf3-1fc70fdb026a)
The most expedient way to get rid of the BOM character when using XDocument is to just save the document, then do a straight File read as a file, then write it back out. The File routines will strip the character out for you:
XDocument xTasks = new XDocument();
XElement xRoot = new XElement("tasklist",
new XAttribute("timestamp",lastUpdated),
new XElement("lasttask",lastTask)
);
...
xTasks.Add(xRoot);
xTasks.Save("tasks.xml");
// read it straight in, write it straight back out. Done.
string[] lines = File.ReadAllLines("tasks.xml");
File.WriteAllLines("tasks.xml",lines);
(it's hoky, but it works for the sake of expediency - at least you'll have a well-formed file to upload to your online provider) ;)
By UTF-8 Documents
String XMLDec = xDoc.Declaration.ToString();
StringBuilder sb = new StringBuilder(XMLDec);
sb.Append(xDoc.ToString());
Encoding encoding = new UTF8Encoding(false); // false = without BOM
File.WriteAllText(outPath, sb.ToString(), encoding);

Remove all hexadecimal characters before loading string into XML Document Object?

I have an xml string that is being posted to an ashx handler on the server. The xml string is built on the client-side and is based on a few different entries made on a form. Occasionally some users will copy and paste from other sources into the web form. When I try to load the xml string into an XMLDocument object using xmldoc.LoadXml(xmlStr) I get the following exception:
System.Xml.XmlException = {"'', hexadecimal value 0x0B, is an invalid character. Line 2, position 1."}
In debug mode I can see the rogue character (sorry I'm not sure of it's official title?):
My questions is how can I sanitise the xml string before I attempt to load it into the XMLDocument object? Do I need a custom function to parse out all these sorts of characters one-by-one or can I use some native .NET4 class to remove them?
Here you have an example to clean xml invalid characters using Regex:
xmlString = CleanInvalidXmlChars(xmlString);
XmlDocument xmlDoc = new XmlDocument();
xmlDoc.LoadXml(xmlString);
public static string CleanInvalidXmlChars(string text)
{
string re = #"[^\x09\x0A\x0D\x20-\xD7FF\xE000-\xFFFD\x10000-x10FFFF]";
return Regex.Replace(text, re, "");
}
A more efficient way to not error out on invalid XML characters would be to use the CheckCharacters flag in XmlReaderSettings.
var xmlDoc = new XmlDocument();
var xmlReaderSettings = new XmlReaderSettings { CheckCharacters = false };
using (var stringReader = new StringReader(xml)) {
using (var xmlReader = XmlReader.Create(stringReader, xmlReaderSettings)) {
xmlDoc.Load(xmlReader);
}
}

Record and replay human-readable protobuf messages using a file with protobuf-csharp-port

I'm using protobuf-csharp-port and I need the ability to record some protobuf messages for replay later. XML would be ideal for me, but I'm flexible as long as a human could go into the file, make changes, then replay the messages from the file.
Using this C# code:
MyRequest req =
MyRequest.CreateBuilder().SetStr("Lafayette").Build();
using (var stringWriter = new StringWriter())
{
var xmlWriterSettings = new XmlWriterSettings
{
ConformanceLevel = ConformanceLevel.Fragment
};
var xmlWriter = XmlWriter.Create(stringWriter, xmlWriterSettings);
ICodedOutputStream output = XmlFormatWriter.CreateInstance(xmlWriter);
req.WriteTo(output);
output.Flush();
string xml = stringWriter.ToString();
using (var streamWriter = new StreamWriter(#"Requests.txt"))
{
streamWriter.WriteLine(xml);
streamWriter.WriteLine(xml);
streamWriter.WriteLine(xml);
}
}
I produce the Requests.txt file containing:
<str>Lafayette</str>
<str>Lafayette</str>
<str>Lafayette</str>
However when I try to deserialize them back using:
var xmlReaderSettings = new XmlReaderSettings
{
ConformanceLevel = ConformanceLevel.Fragment
};
using (var xmlReader = XmlReader.Create(#"Requests.txt", xmlReaderSettings))
{
ICodedInputStream input = XmlFormatReader.CreateInstance(xmlReader);
MyRequest reqFromFile;
while(!input.IsAtEnd)
{
reqFromFile =
ReverseRequest.CreateBuilder().MergeFrom(input).Build();
}
}
Only one MyRequest gets deserialized and the other two are ignored. (After reqFromFile is built, input.IsAtEnd == true.)
So back to my question: is there some way to read multiple human-readable protobuf messages from a file?
So back to my question: is there some way to read multiple human-readable protobuf messages from a file?
Well you're currently creating a single text file with multiple root elements - it's not a valid XML file.
The simplest approach would probably be to create a single XML document with all those requests in, then load it (e.g. with LINQ to XML) and create an XmlReader positioned at each of the root element's children. Each of those XmlReader objects should be able to then deserialize to a single protobuf message.
You could look at the Protocol buffer Editor
Alternatively there is Google own Text format which is a bit like JSon. You can convert between the 2 using the protoc command (look at encode / decode options, you also need the proto definition)

Convert utf-8 XML document to utf-16 for inserting into SQL

I have an XML document that has been created using utf-8 encoding. I want to store that document in a sql 2008 xml column but I understand I need to convert it to utf-16 in order to do that.
I've tried using XDocument to do this but I'm not getting a valid XML result after the conversion. Here is what I've tried to do the conversion on (Utf8StringWriter is a small class that inherits from StringWriter and overloads Encoding):
XDocument xDoc = XDocument.Parse(utf8Xml);
StringWriter writer = new StringWriter();
XmlWriter xml = XmlWriter.Create(writer, new XmlWriterSettings()
{ Encoding = writer.Encoding, Indent = true });
xDoc.WriteTo(xml);
string utf16Xml = writer.ToString();
The data in the utf16Xml is invalid and when trying to insert into the database I get the error:
{"XML parsing: line 1, character 38, unable to switch the encoding"}
However the initial utf8Xml data is definitely valid and contains all the info I need.
UPDATE:
The initial XML is obtained by using XMLSerializer (with an Utf8StringWriter class) to create the xml string from an existing object model (engine). The code for this is:
public static void Serialise<T>(T engine, ref StringWriter writer)
{
XmlWriter xml = XmlWriter.Create(writer, new XmlWriterSettings() { Encoding = writer.Encoding });
XmlSerializer xs = new XmlSerializer(engine.GetType());
xs.Serialize(xml, engine);
}
I have to leave this like this as that code is out of my control to change.
Before I even send the utf16Xml string to the failing database call I can view it via the Visual Studio debugger and I notice that the entire string is not present and instead I get a string literal was not closed error on the XML viewer.
The error is on first line XDocument xDoc = XDocument.Parse(utf8Xml);. Most likely you converted utf8 stream into a string (utf8xml), but encoding specified in the string is still utf-8, so XML reader fails. If it is true than load XML directly from stream using Load instead of converting it to string first.
Set the encoding of the document to UTF-16 after you have parsed it from utf8xml
XDocument xDoc = XDocument.Parse(utf8Xml);
xDoc.Declaration.Encoding = "utf-16";
StringWriter writer = new StringWriter();
XmlWriter xml = XmlWriter.Create(writer, new XmlWriterSettings()
{ Encoding = writer.Encoding, Indent = true });
xDoc.WriteTo(xml);
string utf16Xml = writer.ToString();
Here's what I had to do to make it work. This just converts the XML to utf-16
string getUtf16Xml(System.Xml.XmlDocument xmlDoc)
{
System.Xml.Linq.XDocument xDoc = System.Xml.Linq.XDocument.Parse(xmlDoc.OuterXml);
xDoc.Declaration.Encoding = "utf-16";
return xDoc.ToString();
}
Then I can save the results to the DB.

How to get Xml as string from XDocument?

I am new to LINQ to XML. After you have built XDocument, how do you get the OuterXml of it like you did with XmlDocument?
You only need to use the overridden ToString() method of the object:
XDocument xmlDoc ...
string xml = xmlDoc.ToString();
This works with all XObjects, like XElement, etc.
I don't know when this changed, but today (July 2017) when trying the answers out, I got
"System.Xml.XmlDocument"
Instead of ToString(), you can use the originally intended way accessing the XmlDocument content: writing the xml doc to a stream.
XmlDocument xml = ...;
string result;
using (StringWriter writer = new StringWriter())
{
xml.Save(writer);
result = writer.ToString();
}
Several responses give a slightly incorrect answer.
XDocument.ToString() omits the XML declaration (and, according to #Alex Gordon, may return invalid XML if it contains encoded unusual characters like &).
Saving XDocument to StringWriter will cause .NET to emit encoding="utf-16", which you most likely don't want (if you save XML as a string, it's probably because you want to later save it as a file, and de facto standard for saving files is UTF-8 - .NET saves text files as UTF-8 unless specified otherwise).
#Wolfgang Grinfeld's answer is heading in the right direction, but it's unnecessarily complex.
Use the following:
var memory = new MemoryStream();
xDocument.Save(memory);
string xmlText = Encoding.UTF8.GetString(memory.ToArray());
This will return XML text with UTF-8 declaration.
Doing XDocument.ToString() may not get you the full XML.
In order to get the XML declaration at the start of the XML document as a string, use the XDocument.Save() method:
var ms = new MemoryStream();
using (var xw = XmlWriter.Create(new StreamWriter(ms, Encoding.GetEncoding("ISO-8859-1"))))
new XDocument(new XElement("Root", new XElement("Leaf", "data"))).Save(xw);
var myXml = Encoding.GetEncoding("ISO-8859-1").GetString(ms.ToArray());
Use ToString() to convert XDocument into a string:
string result = string.Empty;
XElement root = new XElement("xml",
new XElement("MsgType", "<![CDATA[" + "text" + "]]>"),
new XElement("Content", "<![CDATA[" + "Hi, this is Wilson Wu Testing for you! You can ask any question but no answer can be replied...." + "]]>"),
new XElement("FuncFlag", 0)
);
result = root.ToString();
While #wolfgang-grinfeld's answer is technically correct (as it also produces the XML declaration, as opposed to just using .ToString() method), the code generated UTF-8 byte order mark (BOM), which for some reason XDocument.Parse(string) method cannot process and throws Data at the root level is invalid. Line 1, position 1. error.
So here is a another solution without the BOM:
var utf8Encoding =
new UTF8Encoding(encoderShouldEmitUTF8Identifier: false);
using (var memory = new MemoryStream())
using (var writer = XmlWriter.Create(memory, new XmlWriterSettings
{
OmitXmlDeclaration = false,
Encoding = utf8Encoding
}))
{
CompanyDataXml.Save(writer);
writer.Flush();
return utf8Encoding.GetString(memory.ToArray());
}
I found this example in the Microsoft .NET 6 documentation for XDocument.Save method. I think it answers the original question (what is the XDocument equivalent for XmlDocument.OuterXml), and also addresses the concerns that others have pointed out already. By using the XmlWritingSettings you can predictably control the string output.
https://learn.microsoft.com/en-us/dotnet/api/system.xml.linq.xdocument.save
StringBuilder sb = new StringBuilder();
XmlWriterSettings xws = new XmlWriterSettings();
xws.OmitXmlDeclaration = true;
xws.Indent = true;
using (XmlWriter xw = XmlWriter.Create(sb, xws)) {
XDocument doc = new XDocument(
new XElement("Child",
new XElement("GrandChild", "some content")
)
);
doc.Save(xw);
}
Console.WriteLine(sb.ToString());
Looking at these answers, I see a lot of unnecessary complexity and inefficiency in pursuit of generating the XML declaration automatically. But since the declaration is so simple, there isn't much value in generating it. Just KISS (keep it simple, stupid):
// Extension method
public static string ToStringWithDeclaration(this XDocument doc, string declaration = null)
{
declaration ??= "<?xml version=\"1.0\" encoding=\"utf-8\"?>\r\n";
return declaration + doc.ToString();
}
// Usage
string xmlString = doc.ToStringWithDeclaration();
// Or
string xmlString = doc.ToStringWithDeclaration("...");
Using XmlWriter instead of ToString() can give you more control over how the output is formatted (such as if you want indentation), and it can write to other targets besides string.
The reason to target a memory stream is performance. It lets you skip the step of storing the XML in a string (since you know the data must end up in a different encoding eventually, whereas string is always UTF-16 in C#). For instance, for an HTTP request:
// Extension method
public static ByteArrayContent ToByteArrayContent(
this XDocument doc, XmlWriterSettings xmlWriterSettings = null)
{
xmlWriterSettings ??= new XmlWriterSettings();
using (var stream = new MemoryStream())
{
using (var writer = XmlWriter.Create(stream, xmlWriterSettings))
{
doc.Save(writer);
}
var content = new ByteArrayContent(stream.GetBuffer(), 0, (int)stream.Length);
content.Headers.ContentType = new MediaTypeHeaderValue("text/xml");
return content;
}
}
// Usage (XDocument -> UTF-8 bytes)
var content = doc.ToByteArrayContent();
var response = await httpClient.PostAsync("/someurl", content);
// Alternative (XDocument -> string -> UTF-8 bytes)
var content = new StringContent(doc.ToStringWithDeclaration(), Encoding.UTF8, "text/xml");
var response = await httpClient.PostAsync("/someurl", content);

Categories