Replacing Special Character with their codes - c#

I am passing XML data to a server from a text Box, now issue is XML is giving issues with symbols like & < |. So i want to replace these symbols with their equivalent codes.
if i use string.replace function it will replace the characters recently replaced as well.
.Replace("&", "&")
.Replace("<", "<")
.Replace("|", "|")
.Replace("!", "!")
.Replace("#", "#")
As it go through complete string again and again.
So &<# will become "&#38;&#60;"
I also tried Dictionary method:
var replacements = new Dictionary<string, string>
{
{"&", "&"},
{"<", "<"},
{"|", "|"},
{"!", "!"},
{"#", "#"}
}
var output = replacements.Aggregate(input, (current, replacement) => current.Replace(replacement.Key, replacement.Value));
return output;
But same issue here as well. I also tried string builder method, but same repeating replacement issue. Any Advise?

You shouldn't be trying to escape characters manually. There are libraries and methods that are already built to do this such the SecurityElement.Escape(). It specifically escapes invalid XML characters into a known safe format that can be unescaped later.

I strongly advise using proper XML handling to build XML:
var id = 3;
var message = "&'<crazyMessage&&";
var xmlDoc = new XmlDocument();
using(var writer = xmlDoc.CreateNavigator().AppendChild())
{
writer.WriteStartElement("ROOT");
writer.WriteElementString("ID", id.ToString());
writer.WriteStartElement("INPUT");
writer.WriteElementString("ENGMSG", message);
writer.WriteEndElement(); // INPUT
writer.WriteEndElement(); // ROOT
}
var xmlString = xmlDoc.InnerXml;
Console.WriteLine(xmlString);
Ideone example
If you are using .NET 3.5 or higher, you can use Linq2Xml to build the XML, which is a bit cleaner:
var id = 3;
var message = "&'<crazyMessage&&";
var xml = new XElement("ROOT",
new XElement("ID", id),
new XElement("INPUT",
new XElement("ENGMSG", message)
)
);
var xmlString = xml.ToString();
Console.WriteLine(xmlString);

public static string Transform(string input, Dictionary<string, string> replacements)
{
string finalString = string.Empty;
for (int i = 0; i < input.Length; i++)
{
if (replacements.ContainsKey(input[i].ToString()))
{
finalString = finalString + replacements[input[i].ToString()];
}
else
{
finalString = finalString + input[i].ToString();
}
}
return finalString;
}

Related

Linq Read XML with <br> tag

I have a xml file, structure is like following:
<template><body>public DiffSectionType Type<template:br/>{<template:br/><template:tab/>get<template:br/><template:tab/>{<template:br/><template:tab/><template:tab/>return _Type;<template:br/><template:tab/>}<template:br/>}</body></template>
I would like to be more readable, like:
public DiffSectionType Type
{
get
{
return _Type;
}
}
<template:br/> => new line
<template:tab/> => tab
I can read body string, but not able to put it in correct format,
I have tried
var document = XDocument.Load("template.xml");
var body = from element in document.Elements("template").Elements("body")
select element;
foreach(var v in body)
{
Console.WriteLine(v.Value);
}
You could use Regex to solve this so something like this:
string str = #"<template><body>public DiffSectionType Type<template:br/>{<template:br/><template:tab/>get<template:br/><template:tab/>{<template:br/><template:tab/><template:tab/>return _Type;<template:br/><template:tab/>}<template:br/>}</body></template>";
str = Regex.Replace(str, "<template:br\x2F>", Environment.NewLine);
str = Regex.Replace(str, "<template:tab\x2F>", "\t");
str = Regex.Replace(str, "(<\x2Ftemplate>)|(<template>)", "");
str = Regex.Replace(str, "(<\x2Fbody>)|(<body>)", "");

Extract text from <1></1> (HTML/XML-Like but with Number Tag)

So I have a long string containing pointy brackets that I wish to extract text parts from.
string exampleString = "<1>text1</1><27>text27</27><3>text3</3>";
I want to be able to get this
1 = "text1"
27 = "text27"
3 = "text3"
How would I obtain this easily? I haven't been able to come up with a non-hacky way to do it.
Thanks.
Using basic XmlReader and some other tricks to do wrapper to create XML-like data, I would do something like this
string xmlString = "<1>text1</1><27>text27</27><3>text3</3>";
xmlString = "<Root>" + xmlString.Replace("<", "<o").Replace("<o/", "</o") + "</Root>";
string key = "";
List<KeyValuePair<string,string>> kvpList = new List<KeyValuePair<string,string>>(); //assuming the result is in the KVP format
using (XmlReader xmlReader = XmlReader.Create(new StringReader(xmlString))){
bool firstElement = true;
while (xmlReader.Read()) {
if (firstElement) { //throwing away root
firstElement = false;
continue;
}
if (xmlReader.NodeType == XmlNodeType.Element) {
key = xmlReader.Name.Substring(1); //cut of "o"
} else if (xmlReader.NodeType == XmlNodeType.Text) {
kvpList.Add(new KeyValuePair<string,string>(key, xmlReader.Value));
}
}
}
Edit:
The main trick is this line:
xmlString = "<Root>" + xmlString.Replace("<", "<o").Replace("<o/", "</o") + "</Root>"; //wrap to make this having single root, o is put to force the tagName started with known letter (comment edit suggested by Mr. chwarr)
Where you first replace all opening pointy brackets with itself + char, i.e.
<1>text1</1> -> <o1>text1<o/1> //first replacement, fix the number issue
and then reverse the sequence of all the opening point brackets + char + forward slash to opening point brackets + forward slash + char
<o1>text1<o/1> -> <o1>text1</o1> //second replacement, fix the ending tag issue
Using simple WinForm with RichTextBox to print out the result,
for (int i = 0; i < kvpList.Count; ++i) {
richTextBox1.AppendText(kvpList[i].Key + " = " + kvpList[i].Value + "\n");
}
Here is the result I get:
This is far from bulletproof, but you could use a combination of split and Regex matching:
string exampleString = "<1>text1</1><27>text27</27><3>text3</3>";
string[] results = exampleString.Split(new string[] { "><" }, StringSplitOptions.None);
Regex r = new Regex(#"^<?(\d+)>([^<]+)<");
foreach (string result in results)
{
Match m = r.Match(result);
if (m.Success)
{
string index = m.Groups[1].Value;
string value = m.Groups[2].Value;
}
}
The most non-bulletproof example I can think of is if your text contains a "<", that would pretty much break this.

Facebook feed - remove extra Facebook JS from anchor

Please help me to replace all the additional Facebook information from here using C# .net Regex Replace method.
Example
http://on.fb.me/OE6gnBsomehtml
Output
somehtml on.fb.me/OE6gnB somehtml
I tried following regex but they didn't work for me
searchPattern = "<a([.]*)?/l.php([.]*)?(\">)?([.]*)?(</a>)?";
replacePattern = "$3";
Thanks
I manage to do this using regex with following code
searchPattern = "<a(.*?)href=\"/l.php...(.*?)&?(.*?)>(.*?)</a>";
string html1 = Regex.Replace(html, searchPattern, delegate(Match oMatch)
{
return string.Format("{1}", HttpUtility.UrlDecode(oMatch.Groups[2].Value), oMatch.Groups[4].Value);
});
You can try this (System.Web has to be added to use System.Web.HttpUtility):
string input = #"http://on.fb.me/OE6gnBsomehtml";
string rootedInput = String.Format("<root>{0}</root>", input);
XDocument doc = XDocument.Parse(rootedInput, LoadOptions.PreserveWhitespace);
string href;
var anchors = doc.Descendants("a").ToArray();
for (int i = anchors.Count() - 1; i >= 0; i--)
{
href = HttpUtility.ParseQueryString(anchors[i].Attribute("href").Value)[0];
XElement newAnchor = new XElement("a");
newAnchor.SetAttributeValue("href", href);
newAnchor.SetValue(href.Replace(#"http://", String.Empty));
anchors[i].ReplaceWith(newAnchor);
}
string output = doc.Root.ToString(SaveOptions.DisableFormatting)
.Replace("<root>", String.Empty)
.Replace("</root>", String.Empty);

Removing style from a string retrieved from WordDocument with Open XML Office SDK

I'm searching for strings inside a word document using the Open XML Office SDK 2.0 and list those.
MatchCollection Matches;
using (WordprocessingDocument wordDoc = WordprocessingDocument.Open(txtLocation.Text, true))
{
string docText = null;
using (StreamReader sr = new StreamReader(wordDoc.MainDocumentPart.GetStream()))
{
docText = sr.ReadToEnd();
}
Regex regex = new Regex(#"\(.*?\)");
Matches = regex.Matches(docText);
}
int i = 0;
while (i < Matches.Count)
{ Label lb = new Label();
lb.Text = Matches[i].ToString();
lb.Location = new System.Drawing.Point(24, (28 + i * 24));
this.panel1.Controls.Add(lb);
i++;
}
The problem is that sometimes it returnes the right string, eg: (HelloWorld) but sometimes its something totally different with tags like: < w:rFonts w:ascii="Arial" w:hAnsi="Arial" w:cs="Arial"/ >
How do I get rid of those?
Found out what I had to do, run the string to another Regex.Replace.
This one replaces all <> tags (so XML/HTML)
String str = Matches[i].ToString();
str = Regex.Replace(str, #"<(.|\n)*?>", "");
lb.Text = str;
Presumably, all the formatting tags are in XML style (between angle brackets). In that case, you can tell if a string is an XML tag using the String.StartsWith and String.EndsWith methods:
// ...
while (i < Matches.Count)
{
String str = Matches[i].ToString();
if (!(str.StartsWith("<") && str.EndsWith(">"))) {
// ...
}
i++;
}

Saving an XML that has invalid characters

there are code snippets that strip the invalid characters inside a string before we save it as an XML ... but I have one more problem: Let's say my user wants to have a column name like "[MyColumnOne] ...so now I do not want to strip these "[","] well because these are the ones that user has defined and wants to see them so if I use some codes that are stripping the invalid characters they are also removing "[" and "[" but in this case I still need them to be saved... what can I do?
Never mind, I changed my RegEx format to use XML 1.1 instead of XML 1.0 and now it is working good :
string pattern = String.Empty;
//pattern = #"#x((10?|[2-F])FFF[EF]|FDD[0-9A-F]|7F|8[0-46-9A-F]9[0-9A-F])"; //XML 1.0
pattern = #"#x((10?|[2-F])FFF[EF]|FDD[0-9A-F]|[19][0-9A-F]|7F|8[0-46-9A-F]|0?[1-8BCEF])"; // XML 1.1
Regex regex = new Regex(pattern, RegexOptions.IgnoreCase);
if (regex.IsMatch(sString))
{
sString = regex.Replace(sString, String.Empty);
File.WriteAllText(sString, sString, Encoding.UTF8);
}
return sString;
This worked for me, and it was fast.
private object NormalizeString(object p) {
object result = p;
if (p is string || p is long) {
string s = string.Format("{0}", p);
string resultString = s.Trim();
if (string.IsNullOrWhiteSpace(resultString)) return "";
Regex rxInvalidChars = new Regex("[\r\n\t]+", RegexOptions.IgnoreCase);
if (rxInvalidChars.IsMatch(resultString)) {
resultString = rxInvalidChars.Replace(resultString, " ");
}
//string pattern = String.Empty;
//pattern = #"";
////pattern = #"#x((10?|[2-F])FFF[EF]|FDD[0-9A-F]|7F|8[0-46-9A-F]9[0-9A-F])"; //XML 1.0
////pattern = #"#x((10?|[2-F])FFF[EF]|FDD[0-9A-F]|[19][0-9A-F]|7F|8[0-46-9A-F]|0?[1-8BCEF])"; // XML 1.1
//Regex rxInvalidXMLChars = new Regex(pattern, RegexOptions.IgnoreCase);
//if (rxInvalidXMLChars.IsMatch(resultString)) {
// resultString = rxInvalidXMLChars.Replace(resultString, "");
//}
result = string.Join("", resultString.Where(c => c >= ' '));
}
return result;
}

Categories