Best way to provide the user an escape string - c#

Suppose I want to ask a user what format they want a certain output to be in and the output will include fill-in fields. So they provide something like this string:
"Output text including some field {FieldName1Value} and another {FieldName2Value} and so on..."
Anything bound by the {} should be a column name in a table somewhere they will be replaced with the the stored value with the code I am writing. Seems simple, I could just do a string.Replace on any instance that matches the patter "{" + FieldName + "}". But, what if I also want to give the user the option of using an escape so they can use brackets like any other string. I was thinking they provide "{{" or "}}" to escape that bracket - nice and easy for them. So, they could provide something like:
"Output text including some field {FieldName1Value} and another {FieldName2Value} but not this {{FieldName2Value}}"
But now that "{{FieldName2Value}}" is to be treated like any other string and ignored by the by the Replace. Also, if they decided to put something like "{{{FieldName2Value}}}" with the triple brackets, that would be interpreted by the code as the field value wrapped with brackets and so on.
This is where I get stuck. I am trying with RegEx and came up with this:
public object Convert(object[] values, Type targetType, object parameter, CultureInfo culture)
{
string format = (string)values[0];
ObservableCollection<CalloutFieldAliasMap> oc = (ObservableCollection<CalloutFieldAliasMap>)values[1];
foreach (CalloutFieldMap map in oc)
format = Regex.Replace(format, #"(?<!{){" + map.FieldName + "(?<!})}", " " + map.FieldAlias + " ", RegexOptions.IgnoreCase);
return format;
}
This works in the situation with double brackets {{ }} but NOT if there are three, ie {{{ }}}. The triple brackets are treated like string when it should be treated as {FieldValue}.
Thanks for any help.

By expanding on your regular expression, the presence of literals can be accommodated.
format = Regex.Replace(format,
#"(?<!([^{]|^){(?:{{)*){" + Regex.Escape(map.FieldName) + "}",
String.Format(" {0} ", map.FieldAlias),
RegexOptions.IgnoreCase | RegexOptions.Compiled);
The first part of the expression, (?<!([^{]|^){(?:{{)*){, designates that the { must be preceded by an even number of { characters for it to mark the beginning of a field token. Thus, {FieldName} and {{{FieldName} will denote the start of a field name, whereas {{FieldName} and {{{{FieldName} would not.
The closing } simply requires that the end of the field be a simple }. There is some ambiguity in the syntax in that {FieldName1Value}}} could be parsed as a token with FieldName1Value (followed by the literal }) or FieldName1Value}. The regex assumes the former. (If the latter is intended, you could replace this with }(?!}(}})*) instead.
A couple of other notes. I added Regex.Escape(map.FieldName) so that all characters in the field name are treated as literals; and added the RegexOptions.Compiled flag. (Since this is both a complex expression and executed in a loop, it is a good candidate for compilation.)
After the loop executes, a simple:
format = format.Replace("{{", "{").Replace("}}", "}")
can be used to unescape the literal {{ and }} characters.

The simplest way would be to use String.Replace to replace the double brackets with a character sequence that the user can not (or almost certainly will not) enter. Then do the replacement of your fields, and finally convert replacement back to the double brackets.
For example, given:
string replaceOpen = "{x"; // 'x' should be something like \u00ff, for example
string replaceClose = "x}";
string template = "Replace {ThisField} but not {{ThatField}}";
string temp = template.Replace("{{", replaceOpen).Replace("}}", replaceClose);
string converted = temp.Replace("{ThisField}", "Foo");
string final = converted.Replace(replaceOpen, "{{").Replace(replaceClose, "}});
It's not particularly pretty, but it's effective.
How you go about it is going to depend in large part on how often you call this, and how fast you really need it to be.

I have an extension method I wrote that almost does what you ask, but, while it does escape using double braces, it doesn't do the triple braces like you suggested. Here is the method (also on GitHub at https://github.com/benallred/Icing/blob/master/Icing/Icing.Core/StringExtensions.cs):
private const string FormatTokenGroupName = "token";
private static readonly Regex FormatRegex = new Regex(#"(?<!\{)\{(?<" + FormatTokenGroupName + #">\w+)\}(?!\})", RegexOptions.Compiled);
public static string Format(this string source, IDictionary<string, string> replacements)
{
if (string.IsNullOrWhiteSpace(source) || replacements == null)
{
return source;
}
string replaced = replacements.Aggregate(source,
(current, pair) =>
FormatRegex.Replace(current,
new MatchEvaluator(match =>
(match.Groups[FormatTokenGroupName].Value == pair.Key
? pair.Value : match.Value))));
return replaced.Replace("{{", "{").Replace("}}", "}");
}
Usage:
"This is my {FieldName}".Format(new Dictionary<string, string>() { { "FieldName", "value" } });
Even easier if you add this:
public static string Format(this string source, object replacements)
{
if (string.IsNullOrWhiteSpace(source) || replacements == null)
{
return source;
}
IDictionary<string, string> replacementsDictionary = new Dictionary<string, string>();
foreach (PropertyDescriptor propertyDescriptor in TypeDescriptor.GetProperties(replacements))
{
string token = propertyDescriptor.Name;
object value = propertyDescriptor.GetValue(replacements);
replacementsDictionary.Add(token, (value != null ? value.ToString() : String.Empty));
}
return Format(source, replacementsDictionary);
}
Usage:
"This is my {FieldName}".Format(new { FieldName = "value" });
Unit tests for this method are at https://github.com/benallred/Icing/blob/master/Icing/Icing.Tests/Core/TestOf_StringExtensions.cs
If this doesn't work, what would your ideal solution do for more than three braces? In other words, if {{{FieldName}}} becomes {value}, what does {{{{FieldName}}}} become? What about {{{{{FieldName}}}}} and so on? While those cases are unlikely, they still need to be handled purposefully.

RegEx will not do what you want because it only knows it's current state and what transitions are available. It has no concept of memory. The language you're trying parse is not regular so you will never be able to write a RegEx to handle the general case. You would need i expressions where i is the number of matching braces.
There is a lot of theory behind this and I'll provide some links at the bottom if you're curious. But basically the language you're trying to parse is context-free and to implement a general solution you'll need model a push down automaton, which uses a stack to ensure that an opening brace has a matching closing brace (yes, this is why most languages have matching braces).
Each time you encounter { you put it on the stack. If you encounter } you pop from the stack. When you empty the stack you will know that you've reached the end of a field. Of course that's a major simplification of the problem, but if you're looking for a general solution it should get you moving in the right direction.
http://en.wikipedia.org/wiki/Regular_language
http://en.wikipedia.org/wiki/Context-free_language
http://en.wikipedia.org/wiki/Pushdown_automaton

Related

String replace with indication if replaced in one line

I'm looking for an efficient, case inventive string replace. If using Regex I don't want to call Regex.IsMatch and then Regex.Replace because that's unnecessary two searches through input instead of one. I could do the following but again this requires an additional local variable. Is there a way to do it in one line without a local variable? Something like Regex.TryReplace(ref string input, ...) that would return a bool.
string input = "string with pattern";
string replaced = Regex.Replace(input , Regex.Escape("pattern"), "replace value", RegexOptions.IgnoreCase);
if (!ReferenceEquals(replaced, input))
{
input = replaced;
// do something
}
You can do it with with a try/catch using the Replace(String, String, String, RegexOptions, TimeSpan)`overload.
try {
Console.WriteLine(Regex.Replace(words, pattern, evaluator,
RegexOptions.IgnorePatternWhitespace,
TimeSpan.FromSeconds(.25)));
}
catch (RegexMatchTimeoutException) {
Console.WriteLine("Returned words:");
}
}
Reference
But you are still performing two operations: trying to replace, and checking if it's replaced, which you'll always be doing. I'm courious on why such a concern of doing two operations in one line.

Is it possible to store a regex match and use part of it as a list enumerator?

I have created a MadLibs style game where the user enters responses to prompts which in turn replace blanks, represented by %s0, %s1 etc., in a story. I have this working using a for loop but someone else suggested I could do it using regex. What I have so far is below, which replaces all instances of %s+number with "wibble". What I was wondering is if it is possible to store the number found by the regex in a temporary variable and in turn use that to return a value from the list Words? E.g. return Regex.Replace(story, pattern, Global.Words[x]); where x is the number returned by the regex pattern as it goes over the string.
static void Main(string[] args)
{
Globals.Words = new List<string>();
Globals.Words.Add("nathan");
Globals.Words.Add("bob");
var text = "Once upon a time there was a %s0 and it was %s1";
Console.WriteLine(FindEscapeCharacters(text));
}
public static string FindEscapeCharacters(string story)
{
var pattern = #"%s([0-9]+)";
return Regex.Replace(story, "%s([0-9]+)", "wibble");
}
Thanks in advance, Nathan.
Not a direct answer to your question about regexes, but if I understand you correctly, there is an easier way to do this:
string baseString = "I have a {0} {1} in my {0} {2}.";
List<string> words = new List<string>() { "red", "cat", "hat" };
string outputString = String.Format(baseString, words.ToArray());
outputString will be I have a red cat in my red hat..
Is that not what you want, or is there more to your question that I'm missing?
Minor elaboration
String.Format uses the following signature:
string Format(string format, params object[] values)
The neat thing about params is that you can either list values separately:
var a = String.Format("...", valueA, valueB, valueC);
but you can also pass in an array directly:
var a = String.Format("...", valueArray);
Note that you can't mix and match the two approaches.
Yes, you are very close in your attempt with Regex.Replace; the last step is to change constant "wibble" into lambda match => how_to_replace_the_match:
var text = "Once upon a time there was a %s0 and it was %s1";
// Once upon a time there was a nathan and it was bob
var result = Regex.Replace(
text,
"%s([0-9]+)",
match => Globals.Words[int.Parse(match.Groups[1].Value)]);
Edit: In case you don't want working with capturing groups by their numbers, you can name them explicitly:
// Once upon a time there was a nathan and it was bob
var result = Regex.Replace(
text,
"%s(?<number>[0-9]+)",
match => Globals.Words[int.Parse(match.Groups["number"].Value)]);
There is an overload of Regex.Replace that, rather than taking a string for the last argument, takes a MatchEvaluator delegate - a function that takes a Match object and returns a string.
You could make that function parse the integer from the Match's Groups[1].Value property and then use that to index into your list, returning the string you find.

Sanitizing a String for a Property Name

Problem
I need to sanitize a collection of Strings from user input to a valid property name.
Context
We have a DataGrid that works with runtime generated classes. These classes are generated based on some parameters. Parameter names are converted into Properties. Some of these parameter names are from user input. We implemented this and it all seemed to work great. Our logic to sanitizing strings was to only allow numbers and letters and convert the rest to an X.
const string regexPattern = #"[^a-zA-Z0-9]";
return ("X" + Regex.Replace(input, regexPattern, "X")); //prefix with X in case the name starts with a number
The property names were always correct and we stored the original string in a dictionary so we could still show a user friendly parameter name.
However, where the trouble starts is when a string only differs in illegal characters like this:
Parameter Name
Parameter_Name
These were both converted into:
ParameterXName
A solution would be to just generate some safe, unrelated names like A, B C. etc. But I would prefer the name to still be recognizable in debug. Unless it's too complicated to implement this behavior of course.
I looked at other questions on StackOverflow, but they all seem to remove illegal characters, which has the same problem.
I feel like I'm reinventing the wheel. Is there some standard solution or trick for this?
I can suggest to change algorithm of generating safe, unrelated and recognizable names.
In c# _ is valid symbol for member names. Replace all invalid symbols (chr) not with X but with "_"+(short)chr+"_".
demo
public class Program
{
public static void Main()
{
string [] props = {"Parameter Name", "Parameter_Name"};
var validNames = props.Select(s=>Sanitize(s)).ToList();
Console.WriteLine(String.Join(Environment.NewLine, validNames));
}
private static string Sanitize(string s)
{
return String.Join("", s.AsEnumerable()
.Select(chr => Char.IsLetter(chr) || Char.IsDigit(chr)
? chr.ToString() // valid symbol
: "_"+(short)chr+"_") // numeric code for invalid symbol
);
}
}
prints
Parameter_32_Name
Parameter_95_Name

C# Regex, any more efficient way to parse string enclosed by symbol?

I'm not sure if it's okay to ask... But here goes.
I implemented a method that parses a string using regex, each matching are parsed through the delegates with an order ( actually, order is not important-- I think, wait, is it? ... But I wrote it this way, and it's not fully tested ):
Pattern Regex.Replace: #"(?<!\\)\$.+?\$" then String.Replace: #"\$", #"$"; Replace string enclosed by dollar sign. Ignores backslash ones, then erases backslash. Ex: "$global name$" -> "motherofglobalvar", "Money \$9000" -> "Money $9000"
Pattern Regex.Replace #"(?<!\\)%.+?%" then String.Replace #"\%", #"%"; Replace string enclosed by percentage sign. Ignores backslash ones, then erase backslash. Same as previous example: "%local var%" -> "lordoflocalvar", "It's over 9000\%" -> "It's over 9000%"
Pattern Regex.Replace #"(?<!\\)#" then String.Replace #"\#", #"#"; Replace char '#' with whitespace, ' '. But ignore backslash ones, then erase the backslash. Ex: "I#hit#the#ground#too#hard" -> "I hit the ground too hard", "qw\#op" -> "qw#op"
What I've done without much experience (I think):
//parse variable
public static string ParseVariable(string text)
{
return Regex.Replace(Regex.Replace(Regex.Replace(text, #"(?<!\\)\$.+?\$", match =>
{
string trim = match.Value.Trim('$');
string trimUpper = trim.ToUpper();
return variableGlobal.ContainsKey(trim) ? variableGlobal[trim] : match.Value;
}).Replace(#"\$", #"$"), #"(?<!\\)%.+?%", match =>
{
string trim = match.Value.Trim('%');
string trimUpper = trim.ToUpper();
return variableLocal.ContainsKey(trim) ? variableLocal[trim] : match.Value;
}).Replace(#"\%", #"%"), #"(?<!\\)#", " ").Replace(#"\#", #"#");
}
In short, what I used is: Regex.Replace().Replace()
Since I need to parse 3 kinds of symbols, I chained it as following: Regex.Replace(Regex.Replace(Regex.Replace().Replace()).Replace()).Replace()
Is there any more efficient way than this? I mean, like without need to go through the text 6 times? (3 times regex.replace, 3 times string.replace, where each replace modifies the text to be used by the next replace )
Or is it the best way it can do?
Thanks.
Here's a unique take on the problem, I think. You can build a class that will be used to construct the overall pattern piece-by-piece. This class will be responsible for the generating of the MatchEvaluator delegate that will be passed to Replace as well.
class RegexReplacer
{
public string Pattern { get; private set; }
public string Replacement { get; private set; }
public string GroupName { get; private set; }
public RegexReplacer NextReplacer { get; private set; }
public RegexReplacer(string pattern, string replacement, string groupName, RegexReplacer nextReplacer = null)
{
this.Pattern = pattern;
this.Replacement = replacement;
this.GroupName = groupName;
this.NextReplacer = nextReplacer;
}
public string GetAggregatedPattern()
{
string constructedPattern = this.Pattern;
string alternation = (this.NextReplacer == null ? string.Empty : "|" + this.NextReplacer.GetAggregatedPattern()); // If there isn't another replacer, then we won't have an alternation; otherwise, we build an alternation between this pattern and the next replacer's "full" pattern
constructedPattern = string.Format("(?<{0}>{1}){2}", this.GroupName, this.Pattern, alternation); // The (?<XXX>) syntax builds a named capture group. This is used by our GetReplacementDelegate metho.
return constructedPattern;
}
public MatchEvaluator GetReplaceDelegate()
{
return (match) =>
{
if (match.Groups[this.GroupName] != null && match.Groups[this.GroupName].Length > 0) // Did we get a hit on the group name?
{
return this.Replacement;
}
else if (this.NextReplacer != null) // No? Then is there another replacer to inspect?
{
MatchEvaluator next = this.NextReplacer.GetReplaceDelegate();
return next(match);
}
else
{
return match.Value; // No? Then simply return the value
}
};
}
}
It should be obvious as to what Pattern and Replacement represent. GroupName is kind of a hack to let the replacement evaluator know which RegexReplacer fragment resulted in the match. NextReplacer points to another replacer instance that holds a different pattern fragment (et al.).
The idea here is to have a kind of linked list of objects that will represent the overall pattern. You can call GetAggregatedPattern on the outer-most replacer to get the full pattern--each replacer calls the next replacer's GetAggregatedPattern to get that replacer's patter fragment, to which it concatenates its own fragment. The GetReplacementDelegate generates a MatchEvaluator. This MatchEvaluator will compare its own GroupName to the Match's captured groups. If the group name was captured, then we have a hit, and we return this replacer's Replacement value. Otherwise, we step into the next replacer (if there is one) and repeat the group name comparison. If there is no hit on any replacer, then we simply yield back the original value (i.e. what was matched by the pattern; this should be rare).
The usage of such might look like this:
string target = #"$global name$ Money \$9000 %local var% It's over 9000\% I#hit#the#ground#too#hard qw\#op";
RegexReplacer dollarWrapped = new RegexReplacer(#"(?<!\\)\$[^$]+\$", "motherofglobalvar", "dollarWrapped");
RegexReplacer slashDollar = new RegexReplacer(#"\\\$", string.Empty, "slashDollar", dollarWrapped);
RegexReplacer percentWrapped = new RegexReplacer(#"(?<!\\)%[^%]+%", "lordoflocalvar", "percentWrapped", slashDollar);
RegexReplacer slashPercent = new RegexReplacer(#"\\%", string.Empty, "slashPercent", percentWrapped);
RegexReplacer singleAt = new RegexReplacer(#"(?<!\\)#", " ", "singleAt", slashPercent);
RegexReplacer slashAt = new RegexReplacer(#"\\#", "#", "slashAt", singleAt);
RegexReplacer replacer = slashAt;
string pattern = replacer.GetAggregatedPattern();
MatchEvaluator evaluator = replacer.GetReplaceDelegate();
string result = Regex.Replace(target, pattern, evaluator);
Because you want each replacer to know if it got a hit, and because we are hacking this by using group names, you want to make sure that each group name is distinct. A simple way to ensure this would be to use a name that's identical to the variable name since you can't have two variables with the same name within the same scope.
You can see above that I am building each part of the pattern separately, but as I build, I pass the previous replacer as a 4th parameter to the current replacer. This builds the chain of replacers. Once built, I use the last replacer constructed in order to generate the overall pattern and evaluator. If you use anything but, then you will only have part of the overall pattern. Finally, it's simply a matter of passing the generated pattern and evaluator to the Replace method.
Keep in mind that this approach was targeted more at the problem as described. It may work in more general scenarios, but I've only worked with what you've presented. Also, since this is more of a parsing question, a parser may be the proper route to take--although the learning curve is going to be higher.
Also keep in mind that I haven't profiled this code. It certainly doesn't loop over the target string multiple times, but it does involve additional method calls during replacement. You would certainly want to test it in your environment.

How to validate that a string doesn't contain HTML using C#

Does anyone have a simple, efficient way of checking that a string doesn't contain HTML? Basically, I want to check that certain fields only contain plain text. I thought about looking for the < character, but that can easily be used in plain text. Another way might be to create a new System.Xml.Linq.XElement using:
XElement.Parse("<wrapper>" + MyString + "</wrapper>")
and check that the XElement contains no child elements, but this seems a little heavyweight for what I need.
The following will match any matching set of tags. i.e. <b>this</b>
Regex tagRegex = new Regex(#"<\s*([^ >]+)[^>]*>.*?<\s*/\s*\1\s*>");
The following will match any single tag. i.e. <b> (it doesn't have to be closed).
Regex tagRegex = new Regex(#"<[^>]+>");
You can then use it like so
bool hasTags = tagRegex.IsMatch(myString);
You could ensure plain text by encoding the input using HttpUtility.HtmlEncode.
In fact, depending on how strict you want the check to be, you could use it to determine if the string contains HTML:
bool containsHTML = (myString != HttpUtility.HtmlEncode(myString));
Here you go:
using System.Text.RegularExpressions;
private bool ContainsHTML(string checkString)
{
return Regex.IsMatch(checkString, "<(.|\n)*?>");
}
That is the simplest way, since items in brackets are unlikely to occur naturally.
I just tried my XElement.Parse solution. I created an extension method on the string class so I can reuse the code easily:
public static bool ContainsXHTML(this string input)
{
try
{
XElement x = XElement.Parse("<wrapper>" + input + "</wrapper>");
return !(x.DescendantNodes().Count() == 1 && x.DescendantNodes().First().NodeType == XmlNodeType.Text);
}
catch (XmlException ex)
{
return true;
}
}
One problem I found was that plain text ampersand and less than characters cause an XmlException and indicate that the field contains HTML (which is wrong). To fix this, the input string passed in first needs to have the ampersands and less than characters converted to their equivalent XHTML entities. I wrote another extension method to do that:
public static string ConvertXHTMLEntities(this string input)
{
// Convert all ampersands to the ampersand entity.
string output = input;
output = output.Replace("&", "amp_token");
output = output.Replace("&", "&");
output = output.Replace("amp_token", "&");
// Convert less than to the less than entity (without messing up tags).
output = output.Replace("< ", "< ");
return output;
}
Now I can take a user submitted string and check that it doesn't contain HTML using the following code:
bool ContainsHTML = UserEnteredString.ConvertXHTMLEntities().ContainsXHTML();
I'm not sure if this is bullet proof, but I think it's good enough for my situation.
this also checks for things like < br /> self enclosed tags with optional whitespace. the list does not contain new html5 tags.
internal static class HtmlExts
{
public static bool containsHtmlTag(this string text, string tag)
{
var pattern = #"<\s*" + tag + #"\s*\/?>";
return Regex.IsMatch(text, pattern, RegexOptions.IgnoreCase);
}
public static bool containsHtmlTags(this string text, string tags)
{
var ba = tags.Split('|').Select(x => new {tag = x, hastag = text.containsHtmlTag(x)}).Where(x => x.hastag);
return ba.Count() > 0;
}
public static bool containsHtmlTags(this string text)
{
return
text.containsHtmlTags(
"a|abbr|acronym|address|area|b|base|bdo|big|blockquote|body|br|button|caption|cite|code|col|colgroup|dd|del|dfn|div|dl|DOCTYPE|dt|em|fieldset|form|h1|h2|h3|h4|h5|h6|head|html|hr|i|img|input|ins|kbd|label|legend|li|link|map|meta|noscript|object|ol|optgroup|option|p|param|pre|q|samp|script|select|small|span|strong|style|sub|sup|table|tbody|td|textarea|tfoot|th|thead|title|tr|tt|ul|var");
}
}
Angle brackets may not be your only challenge. Other characters can also be potentially harmful script injection. Such as the common double hyphen "--", which can also used in SQL injection. And there are others.
On an ASP.Net page, if validateRequest = true in machine.config, web.config or the page directive, the user will get an error page stating "A potentially dangerous Request.Form value was detected from the client" if an HTML tag or various other potential script-injection attacks are detected. You probably want to avoid this and provide a more elegant, less-scary UI experience.
You could test for both the opening and closing tags <> using a regular expression, and allow the text if only one of them occcurs. Allow < or >, but not < followed by some text and then >, in that order.
You could allow angle brackets and HtmlEncode the text to preserve them when the data is persisted.
Beware when using the HttpUtility.HtmlEncode method mentioned above. If you are checking some text with special characters, but not HTML, it will evaluate incorrectly. Maybe that's why J c used "...depending on how strict you want the check to be..."

Categories