Read special characters back from XmlDocument in c# - c#

Say I have the an xml with a escaped ampersand (&). How do I then read it back such that the result gives me the 'un-escaped' text.
Running the following gives me "&amp" as the result. How do I get back '&'
void Main()
{
var xml = #"
<a>
&
</a>
";
var doc = new XmlDocument();
doc.LoadXml(xml);
var ele = (XmlElement)doc.FirstChild;
Console.WriteLine (ele.InnerXml);
}

Use ele.InnerText instead of ele.InnerXml

you can use CDATA in order to get your data
Characters like "<" and "&" are illegal in XML elements."<" the parser interprets it as the start of a new element. "&" the parser interprets it as the start of an character entity.

use
HttpServerUtility.HtmlDecode (ele.InnerXml);

Console.WriteLine (HttpUtility.UrlDecode(String.Format("{0}",ele.InnerXml)));

Try using this static method to decode escaped characters:
HttpServerUtility.HtmlDecode Method (String)
See example here:
http://msdn.microsoft.com/ru-ru/library/hwzhtkke.aspx

The following characters are illegal in XML elements:
Illegal EscapedUsed
------------------
" "
' &apos;
< <
> >
& &
To get the unescaped value you can use:
public string UnescapeXMLValue(string xmlValue)
{
if (string.IsNullOrEmpty(s)) return s;
string temp = s;
temp = temp.Replace("&apos;", "'").Replace(""", "\"").Replace(">", ">").Replace("<", "<").Replace("&", "&");
return temp ;
}
To get the escaped value you can use:
public string EscapeXMLValue(string value)
{
if (string.IsNullOrEmpty(s)) return s;
string temp = s;
temp = temp.Replace("'","&apos;").Replace( "\"", """).Replace(">",">").Replace( "<","<").Replace( "&","&");
return temp ;
}

Related

Replacing anchor/link in text

I'm having issues doing a find / replace type of action in my function, i'm extracting the < a href="link">anchor from an article and replacing it with this format: [link anchor] the link and anchor will be dynamic so i can't hard code the values, what i have so far is:
public static string GetAndFixAnchor(string articleBody, string articleWikiCheck) {
string theString = string.Empty;
switch (articleWikiCheck) {
case "id|wpTextbox1":
StringBuilder newHtml = new StringBuilder(articleBody);
Regex r = new Regex(#"\<a href=\""([^\""]+)\"">([^<]+)");
string final = string.Empty;
foreach (var match in r.Matches(theString).Cast<Match>().OrderByDescending(m => m.Index))
{
string text = match.Groups[2].Value;
string newHref = "[" + match.Groups[1].Index + " " + match.Groups[1].Index + "]";
newHtml.Remove(match.Groups[1].Index, match.Groups[1].Length);
newHtml.Insert(match.Groups[1].Index, newHref);
}
theString = newHtml.ToString();
break;
default:
theString = articleBody;
break;
}
Helpers.ReturnMessage(theString);
return theString;
}
Currently, it just returns the article as it originally is, with the traditional anchor text format: < a href="link">anchor
Can anyone see what i have done wrong?
regards
If your input is HTML, you should consider using a corresponding parser, HtmlAgilityPack being really helpful.
As for the current code, it looks too verbose. You may use a single Regex.Replace to perform the search and replace in one pass:
public static string GetAndFixAnchor(string articleBody, string articleWikiCheck) {
if (articleWikiCheck == "id|wpTextbox1")
{
return Regex.Replace(articleBody, #"<a\s+href=""([^""]+)"">([^<]+)", "[$1 $2]");
}
else
{
// Helpers.ReturnMessage(articleBody); // Uncomment if it is necessary
return articleBody;
}
}
See the regex demo.
The <a\s+href="([^"]+)">([^<]+) regex matches <a, 1 or more whitespaces, href=", then captures into Group 1 any one or more chars other than ", then matches "> and then captures into Group 2 any one or more chars other than <.
The [$1 $2] replacement replaces the matched text with [, Group 1 contents, space, Group 2 contents and a ].
Updated (Corrected regex to support whitespaces and new lines)
You can try this expression
Regex r = new Regex(#"<[\s\n]*a[\s\n]*(([^\s]+\s*[ ]*=*[ ]*[\s|\n*]*('|"").*\3)[\s\n]*)*href[ ]*=[ ]*('|"")(?<link>.*)\4[.\n]*>(?<anchor>[\s\S]*?)[\s\n]*<\/[\s\n]*a>");
It will match your anchors, even if they are splitted into multiple lines. The reason why it is so long is because it supports empty whitespaces between the tags and their values, and C# does not supports subroutines, so this part [\s\n]* has to be repeated multiple times.
You can see a working sample at dotnetfiddle
You can use it in your example like this.
public static string GetAndFixAnchor(string articleBody, string articleWikiCheck) {
if (articleWikiCheck == "id|wpTextbox1")
{
return Regex.Replace(articleBody,
#"<[\s\n]*a[\s\n]*(([^\s]+\s*[ ]*=*[ ]*[\s|\n*]*('|"").*\3)[\s\n]*)*href[ ]*=[ ]*('|"")(?<link>.*)\4[.\n]*>(?<anchor>[\s\S]*?)[\s\n]*<\/[\s\n]*a>",
"[${link} ${anchor}]");
}
else
{
return articleBody;
}
}

XmlSerializer escapes an added escape character

I'm using the XmlSerializer to output a class to a .xml file. For the most part, this is working as expected and intended. However, as a requirement, certain characters need to be removed from the values of the data and replaced with their proper escape characters.
In the elements I need to replace values in, I'm using the Replace() method and returning the updated string. The code below shows this string replacement; the lines commented out are because the XmlSerializer already escapes those particular characters.
I have a requirement from a third-party to escape &, <, >, ', and " characters when they appear within the values of the XML elements. Currently the characters &, <, and > are being escaped appropriately through the XmlSerializer.
The error received when these characters are present is:
Our system has detected a potential threat in the request message attachment.
However, when I serialize the XML Document after performing the string replace, the XmlSerializer sees the & character in &apos; and makes it &apos;. I think this is a correct functionality of the XmlSerializer object. However, I would like the serializer to either a.) ignore the escape characters; or b.) serialize the other characters which are necessary to escape.
Can anyone shed some light on, specifically, how to accomplish either of these?
String Replacement Method
public static string CheckValueOfProperty(string str)
{
string trimmedString = str.Trim();
if (string.IsNullOrEmpty(trimmedString))
return null;
else
{
// Commented out because the Serializer already transforms a '&' character into the appropriate escape character.
//trimmedString = trimmedString .Replace("&", "&");
//trimmedString = trimmedString.Replace("<", "<");
//trimmedString = trimmedString.Replace(">", ">");
trimmedString = trimmedString.Replace("'", "&apos;");
trimmedString = trimmedString.Replace("\"", """);
return trimmedString;
}
}
XmlSerializer Code
public static void SerializeAndOutput(object obj, string outputFilePath, XmlSerializerNamespaces ns = null)
{
XmlSerializer x = new XmlSerializer(obj.GetType());
// If the Output File already exists, delete it.
if (File.Exists(outputFilePath))
{
File.Delete(outputFilePath);
}
// Then, Create the Output File and Serialize the parameterized object as Xml to the Output File
using (TextWriter tw = File.CreateText(outputFilePath))
{
if (ns == null)
{
x.Serialize(tw, obj);
}
else { x.Serialize(tw, obj, ns); }
}
// =====================================================================
// The code below here is no longer needed, was used to force "utf-8" to
// UTF-8" to ensure the result was what was being expected.
// =====================================================================
// Create a new XmlDocument object, and load the contents of the OutputFile into the XmlDocument
// XmlDocument xdoc = new XmlDocument() { PreserveWhitespace = true };
// xdoc.Load(outputFilePath);
// Set the Encoding property of each XmlDeclaration in the document to "UTF-8";
// xdoc.ChildNodes.OfType<XmlDeclaration>().ToList().ForEach(d => d.Encoding = "UTF-8");
// Save the XmlDocument to the Output File Path.
// xdoc.Save(outputFilePath);
}
The single and double quote characters do not need to be escaped when used inside the node content in XML. The single quote or double quote characters only need to be escaped when used in a value of a node attribute. That's why the XMLSerializer does not escape them. And you also do not need to escape them.
See this question and answer for reference.
BTW: The way you set the Encoding to UTF-8 afterwards, is awkward as well. You can specify the encoding with the StreamWriter and then the XMLSerializer will automatically use that encoding and also specify it in the XML declaration.
Here's the solution I came up with. I have only tested it with a sample XML file and not the actual XML file I'm creating, so performance may take a hit; however, this seems to be working.
I'm reading the XML file line-by-line as a string, and replacing any of the defined "special" characters found in the string with their appropriate escape characters. It should process in the order of the specialCharacterList Dictionary<string, string> variable, which means the & character should process first. When processing <, > and " characters, it will only look at the value of the XML element.
using System;
using System.Collections.Generic;
using System.IO;
namespace testSerializer
{
class Program
{
private static string filePath = AppDomain.CurrentDomain.BaseDirectory + "testFile.xml";
private static string tempFile = AppDomain.CurrentDomain.BaseDirectory + "tempFile.xml";
private static Dictionary<string, string> specialCharacterList = new Dictionary<string, string>()
{
{"&","&"}, {"<","<"}, {">",">"}, {"'","&apos;"}, {"\"","""}
};
static void Main(string[] args)
{
ReplaceSpecialCharacters();
}
private static void ReplaceSpecialCharacters()
{
string[] allLines = File.ReadAllLines(filePath);
using (TextWriter tw = File.CreateText(tempFile))
{
foreach (string strLine in allLines)
{
string newLineString = "";
string originalString = strLine;
foreach (var item in specialCharacterList)
{
// Since these characters are all valid characters to be present in the XML,
// We need to look specifically within the VALUE of the XML Element.
if (item.Key == "\"" || item.Key == "<" || item.Key == ">")
{
// Find the ending character of the beginning XML tag.
int firstIndexOfCloseBracket = originalString.IndexOf('>');
// Find the beginning character of the ending XML tag.
int lastIndexOfOpenBracket = originalString.LastIndexOf('<');
if (lastIndexOfOpenBracket > firstIndexOfCloseBracket)
{
// Determine the length of the string between the XML tags.
int lengthOfStringBetweenBrackets = lastIndexOfOpenBracket - firstIndexOfCloseBracket;
// Retrieve the string that is between the element tags.
string valueOfElement = originalString.Substring(firstIndexOfCloseBracket + 1, lengthOfStringBetweenBrackets - 1);
newLineString = originalString.Substring(0, firstIndexOfCloseBracket + 1) + valueOfElement.Replace(item.Key, item.Value) + originalString.Substring(lastIndexOfOpenBracket);
}
}
// For the ampersand (&) and apostrophe (') characters, simply replace any found with the escape.
else
{
newLineString = originalString.Replace(item.Key, item.Value);
}
// Set the "original" string to the new version.
originalString = newLineString;
}
tw.WriteLine(newLineString);
}
}
}
}
}

Replace character references for invalid XML characters

I am projecting some data as XML from SQL Server using ADO.NET. Some of my data contains characters that are invalid in XML, such as CHAR(7) (known as BEL).
SELECT 'This is BEL: ' + CHAR(7) AS A FOR XML RAW
SQL Server encodes such invalid characters as numeric references:
<row A="This is BEL: " />
However, even the encoded form is invalid under XML 1.0, and will give rise to errors in XML parsers:
var doc = XDocument.Parse("<row A=\"This is BEL: \" />");
// XmlException: ' ', hexadecimal value 0x07, is an invalid character. Line 1, position 25.
I would like to replace all these invalid numeric references with the Unicode replacement character, '�'. I know how to do this for unencoded XML:
string str = "<row A=\"This is BEL: \u0007\" />";
if (str.Any(c => !XmlConvert.IsXmlChar(c)))
str = new string(str.Select(c => XmlConvert.IsXmlChar(c) ? c : '�').ToArray());
// <row A="This is BEL: �" />
Is there a straightforward way to make it work for encoded XML too? I would prefer to avoid having to HtmlDecode then HtmlEncode the whole string, in order not to risk introducing changes other than invalid character replacement.
Edit: The conversion needs to be done in my C# code, not SQL, in order for it to be implemented centrally.
I made another go at it using regular expressions. This should handle both decimal and hex character codes. Also, this will not affect anything but numerically encoded characters.
public string ReplaceXMLEncodedCharacters(string input)
{
const string pattern = #"&#(x?)([A-Fa-f0-9]+);";
MatchCollection matches = Regex.Matches(input, pattern);
int offset = 0;
foreach (Match match in matches)
{
int charCode = 0;
if (string.IsNullOrEmpty(match.Groups[1].Value))
charCode = int.Parse(match.Groups[2].Value);
else
charCode = int.Parse(match.Groups[2].Value, System.Globalization.NumberStyles.HexNumber);
char character = (char)charCode;
input = input.Remove(match.Index - offset, match.Length).Insert(match.Index - offset, character.ToString());
offset += match.Length - 1;
}
return input;
}
You can wrap the special characters in the CDATA tag. This informs the parser to ignore text within the tag. To use your example:
SELECT 'This is BEL: <![CDATA[' + CHAR(7) + ']]>' AS A FOR XML RAW
This will allow the XML to be parsed at the very least, albeit requiring a slight change to the document structure.
For reference, this is my solution. I've built on Tonkleton's answer, but modified it to match the internal implementation of HtmlDecode more closely. The code below ignores surrogate pairs.
// numeric character references
static readonly Regex ncrRegex = new Regex("&#x?[A-Fa-f0-9]+;");
static string ReplaceInvalidXmlCharacterReferences(string input)
{
if (input.IndexOf("&#") == -1) // optimization
return input;
return ncrRegex.Replace(input, match =>
{
string ncr = match.Value;
uint num;
var frmt = NumberFormatInfo.InvariantInfo;
bool isParsed =
ncr[2] == 'x' ? // the x must be lowercase in XML documents
uint.TryParse(ncr.Substring(3, ncr.Length - 4), NumberStyles.AllowHexSpecifier, frmt, out num) :
uint.TryParse(ncr.Substring(2, ncr.Length - 3), NumberStyles.Integer, frmt, out num);
return isParsed && !XmlConvert.IsXmlChar((char)num) ? "�" : ncr;
});
}

c# Remove string from XML

I have an xml that has several attributes and values such as follows:
<z:row ID="1"
Author="2;#Bruce, Banner"
Editor="1;#Bruce, Banner"
FileRef="1;#Reports/Pipeline Tracker Report.xltm"
FileDirRef="1;#Reports"
Last_x0020_Modified="1;#2014-04-04 12:05:56"
Created_x0020_Date="1;#2014-04-04 11:36:21"
File_x0020_Size="1;#311815"
/>
How can I remove the string from after the " up to the #?
Original
'Author="2;#Bruce, Banner"'
Converted
'Author="Bruce, Banner"'
See if this helps.
private string FilterValue(string input)
{
// If the string does not contain #, return value
if (!input.Contains("#"))
return input;
// # does exist in the string so
// 1) find its location
// 2) Read everything from that point to the end of the string
// 3) Return the SubString value
var index = input.IndexOf("#", StringComparison.Ordinal) + 1;
return input.Substring(index, input.Length - index);
}
Something like this ?
// same logic then M Patel.
// This one will fit only if you have three items to remove (one digit, one semi-colon and one sharp).
// use M Patel solution
string CleanElement(string elem)
{
return elem.Substring(3, elem.Length - 3);
}
or like this :
// slower I guess but still a solution
string CleanElement(string elem)
{
string[] strs = elem.Split('#');
strs[0] = "";
return string.Join("", strs);
}
You can use string.Substring and string.IndexOf methods
string value= node.Attributes["Author"].Value;
value=value.Substring(0, value.IndexOf('#'));
I hope this is what you are looking for assuming that you are already reading your node from xml document
If you are new to reading XML in c#, I would recommend you to take a look at following msdn link https://msdn.microsoft.com/en-us/library/cc189056(v=vs.95).aspx
You can use regex for for seraching you pattern and use regEx.Replace() method.
Regex might goes like this "\d;#".
It should work if entry is 2;#Bruce, Banner!
string value= node.Attributes["Author"].Value;
var op = value.Split('#');
string name = op[1];
If other # is expected then,
string value1 = value.Substring(3, value.Length - 3);
You can use a simple regex:
string s = #"<z:row ID=""1""
Author=""2;#Bruce, Banner""
Editor=""1;#Bruce, Banner""
FileRef=""1;#Reports/Pipeline Tracker Report.xltm""
FileDirRef=""1;#Reports""
Last_x0020_Modified=""1;#2014-04-04 12:05:56""
Created_x0020_Date=""1;#2014-04-04 11:36:21""
File_x0020_Size=""1;#311815""
/>";
string result = Regex.Replace(s,"\"([0-9];#)","");

Losing the 'less than' sign in HtmlAgilityPack loadhtml

I recently started experimenting with the HtmlAgilityPack. I am not familiar with all of its options and I think therefor I am doing something wrong.
I have a string with the following content:
string s = "<span style=\"color: #0000FF;\"><</span>";
You see that in my span I have a 'less than' sign.
I process this string with the following code:
HtmlDocument htmlDocument = new HtmlDocument();
htmlDocument.LoadHtml(s);
But when I do a quick and dirty look in the span like this:
htmlDocument.DocumentNode.ChildNodes[0].InnerHtml
I see that the span is empty.
What option do I need to set maintain the 'less than' sign. I already tried this:
htmlDocument.OptionAutoCloseOnEnd = false;
htmlDocument.OptionCheckSyntax = false;
htmlDocument.OptionFixNestedTags = false;
but with no success.
I know it is invalid HTML. I am using this to fix invalid HTML and use HTMLEncode on the 'less than' signs
Please direct me in the right direction. Thanks in advance
The Html Agility Packs detects this as an error and creates an HtmlParseError instance for it. You can read all errors using the ParseErrors of the HtmlDocument class. So, if you run this code:
string s = "<span style=\"color: #0000FF;\"><</span>";
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(s);
doc.Save(Console.Out);
Console.WriteLine();
Console.WriteLine();
foreach (HtmlParseError err in doc.ParseErrors)
{
Console.WriteLine("Error");
Console.WriteLine(" code=" + err.Code);
Console.WriteLine(" reason=" + err.Reason);
Console.WriteLine(" text=" + err.SourceText);
Console.WriteLine(" line=" + err.Line);
Console.WriteLine(" pos=" + err.StreamPosition);
Console.WriteLine(" col=" + err.LinePosition);
}
It will display this (the corrected text first, and details about the error then):
<span style="color: #0000FF;"></span>
Error
code=EndTagNotRequired
reason=End tag </> is not required
text=<
line=1
pos=30
col=31
So you can try to fix this error, as you have all required information (including line, column, and stream position) but the general process of fixing (not detecting) errors in HTML is very complex.
As mentioned in another answer, the best solution I found was to pre-parse the HTML to convert orphaned < symbols to their HTML encoded value <.
return Regex.Replace(html, "<(?![^<]+>)", "<");
Fix the markup, because your HTML string is invalid:
string s = "<span style=\"color: #0000FF;\"><</span>";
Although it is true that the given html is invalid, HtmlAgilityPack should still be able to parse it. It is not an uncommon mistake on the web to forget to encode "<", and if HtmlAgilityPack is used as a crawler, then it should anticipate bad html. I tested the example in IE, Chrome and Firefox, and they all show the extra < as text.
I wrote the following method that you can use to preprocess the html string and replace all 'unclosed' '<' characters with "<":
static string PreProcess(string htmlInput)
{
// Stores the index of the last unclosed '<' character, or -1 if the last '<' character is closed.
int lastGt = -1;
// This list will be populated with all the unclosed '<' characters.
List<int> gtPositions = new List<int>();
// Collect the unclosed '<' characters.
for (int i = 0; i < htmlInput.Length; i++)
{
if (htmlInput[i] == '<')
{
if (lastGt != -1)
gtPositions.Add(lastGt);
lastGt = i;
}
else if (htmlInput[i] == '>')
lastGt = -1;
}
if (lastGt != -1)
gtPositions.Add(lastGt);
// If no unclosed '<' characters are found, then just return the input string.
if (gtPositions.Count == 0)
return htmlInput;
// Build the output string, replace all unclosed '<' character by "<".
StringBuilder htmlOutput = new StringBuilder(htmlInput.Length + 3 * gtPositions.Count);
int start = 0;
foreach (int gtPosition in gtPositions)
{
htmlOutput.Append(htmlInput.Substring(start, gtPosition - start));
htmlOutput.Append("<");
start = gtPosition + 1;
}
htmlOutput.Append(htmlInput.Substring(start));
return htmlOutput.ToString();
}
string "s" is bad html.
string s = "<span style=\"color: #0000FF;\"><</span>";
it's true.

Categories