Unicode escape characters not being read by XmlReader - c#

I've got an XML document that I'm importing into an XmlReader that has some unicode formatting I need to preserve. I'm preserving the whitespace but it's dropping the encoded #x2028 which I assume should be expressed as a line break.
Here's my code:
var settings = new XmlReaderSettings
{
ProhibitDtd = false,
XmlResolver = null,
IgnoreWhitespace = false
};
var reader = XmlReader.Create(new StreamReader(fu.PostedFile.InputStream), settings);
var document = new XmlDocument {PreserveWhitespace = true};
document.Load(reader);
return document;
XML example:
<td valign="top" align="center">Camels and camel 
resting place</td>
How do I get to those characters to I can render br tags?

Your question is unclear: do you expect the XmlReader to translate the 
 into an HTML <br> tag? That isn't going to happen.
Or are you examining the actual character content of the <td> element (within the code, not as printed/displayed) and seeing "camel resting place"? If yes, please show the code that you're using to verify this, because it would be a pretty major bug.
Or something else?

After importing the code into the reader I was able to find and replace that character:
Regex.Replace(s, "\u2028", "<br/>");

Related

IgnoreWhiteSpace not ignoring whitespace at beginning of xml string

Question
Should whitespace be ignored at the beginning of my multi-line string literal xml?
Code
string XML = #"
<?xml version=""1.0"" encoding=""utf-8"" ?>"
using (StringReader stringReader = new StringReader(XML))
using (XmlReader xmlReader = XmlReader.Create(stringReader,
new XmlReaderSettings() { IgnoreWhitespace = true }))
{
xmlReader.MoveToContent();
// further implementation withheld
}
Notice in the above code that there is white space before the XML declaration, this doesn't seem to be being ignored despite my setting of the IgnoreWhiteSpace property. Where am I going wrong?!
Note: I have the same behaviour when the XML string does not have a line break, and just a whitespace, as below. I know this will run if I remove the whitespace, my question is as to why the property doesn't take care of this?
string XML = #" <?xml version=""1.0"" encoding=""utf-8"" ?>"
The documentations say that the IgnoreWhitespace property will "Gets or sets a value indicating whether to ignore insignificant white space.". While that first whitespace (and also linebreak) should be insignificant, the one who made XmlReader apparently didn't think so. Just trim XML before use, and you'll be fine.
As stated in comments and for clarity, change your code to:
string XML = #"<?xml version=""1.0"" encoding=""utf-8"" ?>"
using (StringReader stringReader = new StringReader(XML.Trim()))
using (XmlReader xmlReader = XmlReader.Create(stringReader,
new XmlReaderSettings() { IgnoreWhitespace = true }))
{
xmlReader.MoveToContent();
// further implementation withheld
}
According to Microsoft's documentation regarding XML Declaration
The XML declaration typically appears as the first line in an XML
document. The XML declaration is not required, however, if used it
must be the first line in the document and no other content or white
space can precede it.
The parse should fail for your code because white space precedes the XML declaration. Removing either the white space OR the xml declaration will result in a successful parse.
In other words it would be a bug if XmlReaderSettings were at odds with the documentation for XML Declaration - it is defined behavior.
Here's some code demonstrating the above rules.
using System;
using System.Web;
using System.Xml;
using System.Xml.Linq;
public class Program
{
public static void Main()
{
//The XML declaration is not required, however, if used it must
// be the first line in the document and no other content or
//white space can precede it.
// here, no problem because this does not have an XML declaration
string xml = #"
<xml></xml>";
XDocument doc = XDocument.Parse(xml);
Console.WriteLine(doc.Document.Declaration);
Console.WriteLine(doc.Document);
//
// problem here because this does have an XML declaration
//
xml = #"
<?xml version=""1.0"" encoding=""utf-8"" ?><xml></xml>";
try
{
doc = XDocument.Parse(xml);
Console.WriteLine(doc.Document.Declaration);
Console.WriteLine(doc.Document);
} catch(Exception e) {
Console.WriteLine(e.Message);
}
}
}

How can I add specific escape characters to xmlserializer?

I have a method that serializes an object to xml and returns the string:
public static string SerializeType<T>(T item)
{
var serializer = new XmlSerializer(typeof(T));
var builder = new StringBuilder();
var settings = new XmlWriterSettings();
settings.OmitXmlDeclaration = true;
using (var stringWriter = XmlWriter.Create(builder, settings))
{
serializer.Serialize(stringWriter, item);
return builder.ToString();
}
}
However, it is not removing all the reserved characters from strings in objects I pass in. Microsoft lists the Reserved Characters as <>&% but when I input an item with a "abc&cd%d" string field, it spits out "a &lt ;ab&gt ;bc&amp ;cd%d" without out the spaces preceding the semicolons. % is not being escaped. How can I add the correct escape sequence for percent? The % causes an error when I send it to a client's app. The escaping listed on that page fixes the problem.
% isn't really a reserved character in XML. The documentation you've referred to is for SQL server, and there's a small note under the table:
The Notification Services XML vocabulary reserves the percent sign (%) for denoting parameters.
But you shouldn't expect XmlSerializer (or any other general-purpose XML library) to escape % for you. Unless you're using "Notification Services XML" I wouldn't expect this to be a problem.

How to get text from html nodes and solve character encoding issue?

I'm trying to get innertext in this site http://www.hurriyet.com.tr/yazarlar/22933964.asp
with htmlagilitypack.
html structure is
<div class="detailText">
<span class="yzrArticleDate">30 Mart 2014</span>
<h1 class="yazarArticleTitle">31 Mart sabahı için acil ihtiyaç listesi</h1>
<p></p><p><p >Akıl.<br />Sağduyu.<br />Barış.<br />
Özgürlük.<br />Kardeşlik.<br />Vicdan.<br />Huzur.............
and my current code
string htmlContent = getsource(s);
HtmlAgilityPack.HtmlDocument document = new HtmlAgilityPack.HtmlDocument();
document.LoadHtml(htmlContent);
var noa =document.DocumentNode.SelectSingleNode("*//div[#class='detailText']").InnerText;
problem is it gets with the heading and date. I mean with "30 Mart 2014" and "31 Mart sabahı için acil ihtiyaç listesi".
I want the part which begins with
<*p><*/p><*p><p* >Akıl.<*br "
I tried different variation
var noa =document.DocumentNode.SelectSingleNode("*//div[#class='detailText']").InnerHtml;
var noa = document.DocumentNode.SelectSingleNode("*//div[#class='detailText']").NextSibling.NextSibling.InnerText;
var noa = document.DocumentNode.SelectSingleNode("*//div[#class='detailText']").LastSibling.InnerText;
my second question ; if I manage to text this text I ll be faced a character encoding problem, how can I fix this
The easiest solution would be to remove nodes you don't want and than get InnerHtml/InnerText as covered in remove html node from htmldocument :HTMLAgilityPack.
var noa =document.DocumentNode.SelectSingleNode("*//div[#class='detailText']")
noa.RemoveChild(noa.SelectSingleNode("span"));
// remove the rest too...
var result = noa.InnerText;
There should be no encoding problem unless site reports invalid encoding as C# strings are Unicode (UTF16).

XmlWriter inserting spaces when xml:space=preserve

Given this code (C#, .NET 3.5 SP1):
var doc = new XmlDocument();
doc.LoadXml("<?xml version=\"1.0\"?><root>"
+ "<value xml:space=\"preserve\">"
+ "<item>content</item>"
+ "<item>content</item>"
+ "</value></root>");
var text = new StringWriter();
var settings = new XmlWriterSettings() { Indent = true, CloseOutput = true };
using (var writer = XmlWriter.Create(text, settings))
{
doc.DocumentElement.WriteTo(writer);
}
var xml = text.GetStringBuilder().ToString();
Assert.AreEqual("<?xml version=\"1.0\" encoding=\"utf-16\"?>\r\n<root>\r\n"
+ " <value xml:space=\"preserve\"><item>content</item>"
+ "<item>content</item></value>\r\n</root>", xml);
The assertion fails because the XmlWriter is inserting a newline and indent around the <item> elements, which would seem to contradict the xml:space="preserve" attribute.
I am trying to take input with no whitespace (or only significant whitespace, and already loaded into an XmlDocument) and pretty-print it without adding any whitespace inside elements marked to preserve whitespace (for obvious reasons).
Is this a bug or am I doing something wrong? Is there a better way to achieve what I'm trying to do?
Edit: I should probably add that I do have to use an XmlWriter with Indent=true on the output side. In the "real" code, this is being passed in from outside of my code.
Ok, I've found a workaround.
It turns out that XmlWriter does the correct thing if there actually is any whitespace within the xml:space="preserve" block -- it's only when there isn't any that it screws up and adds some. And conveniently, this also works if there are some whitespace nodes, even if they're empty. So the trick that I've come up with is to decorate the document with extra 0-length whitespace in the appropriate places before trying to write it out. The result is exactly what I want: pretty printing everywhere except where whitespace is significant.
The workaround is to change the inner block to:
PreserveWhitespace(doc.DocumentElement);
doc.DocumentElement.WriteTo(writer);
...
private static void PreserveWhitespace(XmlElement root)
{
var nsmgr = new XmlNamespaceManager(root.OwnerDocument.NameTable);
foreach (var element in root.SelectNodes("//*[#xml:space='preserve']", nsmgr)
.OfType<XmlElement>())
{
if (element.HasChildNodes && !(element.FirstChild is XmlSignificantWhitespace))
{
var whitespace = element.OwnerDocument.CreateSignificantWhitespace("");
element.InsertBefore(whitespace, element.FirstChild);
}
}
}
I'm still thinking that this behaviour of XmlWriter is a bug, though.

Using XDocument to write raw XML

I'm trying to create a spreadsheet in XML Spreadsheet 2003 format (so Excel can read it). I'm writing out the document using the XDocument class, and I need to get a newline in the body of one of the <Cell> tags. Excel, when it reads and writes, requires the files to have the literal string
embedded in the string to correctly show the newline in the spreadsheet. It also writes it out as such.
The problem is that XDocument is writing CR-LF (\r\n) when I have newlines in my data, and it automatically escapes ampersands for me when I try to do a .Replace() on the input string, so I end up with &#10; in my file, which Excel just happily writes out as a string literal.
Is there any way to make XDocument write out the literal
as part of the XML stream? I know I can do it by deriving from XmlTextWriter, or literally just writing out the file with a TextWriter, but I'd prefer not to if possible.
I wonder if it might be better to use XmlWriter directly, and WriteRaw?
A quick check shows that XmlDocument makes a slightly better job of it, but xml and whitespace gets tricky very quickly...
I battled with this problem for a couple of days and finally came up with this solution. I used XMLDocument.Save(Stream) method, then got the formatted XML string from the stream. Then I replaced the &#10; occurrences with
and used the TextWriter to write the string to a file.
string xml = "<?xml version=\"1.0\"?><?mso-application progid='Excel.Sheet'?><Workbook xmlns=\"urn:schemas-microsoft-com:office:spreadsheet\" xmlns:o=\"urn:schemas-microsoft-com:office:office\" xmlns:x=\"urn:schemas-microsoft-com:office:excel\" xmlns:ss=\"urn:schemas-microsoft-com:office:spreadsheet\" xmlns:html=\"http://www.w3.org/TR/REC-html40\">";
xml += "<Styles><Style ss:ID=\"s1\"><Alignment ss:Vertical=\"Center\" ss:WrapText=\"1\"/></Style></Styles>";
xml += "<Worksheet ss:Name=\"Default\"><Table><Column ss:Index=\"1\" ss:AutoFitWidth=\"0\" ss:Width=\"75\" /><Row><Cell ss:StyleID=\"s1\"><Data ss:Type=\"String\">Hello&#10;&#10;World</Data></Cell></Row></Table></Worksheet></Workbook>";
System.Xml.XmlDocument doc = new System.Xml.XmlDocument();
doc.LoadXml(xml); //load the xml string
System.IO.MemoryStream stream = new System.IO.MemoryStream();
doc.Save(stream); //save the xml as a formatted string
stream.Position = 0; //reset the stream position since it will be at the end from the Save method
System.IO.StreamReader reader = new System.IO.StreamReader(stream);
string formattedXML = reader.ReadToEnd(); //fetch the formatted XML into a string
formattedXML = formattedXML.Replace("&#10;", "
"); //Replace the unhelpful &#10;'s with the wanted endline entity
System.IO.TextWriter writer = new System.IO.StreamWriter("C:\\Temp\test1.xls");
writer.Write(formattedXML); //write the XML to a file
writer.Close();

Categories