Sanitize XML Attribute Values - c#

How can i easily sanitize the values I pass into the Value property of an XAttribute.

Here's an extension method to clean away your trouble. /0 is not allowed in XML. I'm not sure if other chars are also disallowed, but I believe not. Probably best to start at ' '.
void Main()
{
Console.WriteLine("123\0394".XmlSanitize());
}
public static class XmlHelpers
{
public static string XmlSanitize(this string badString)
{
return new String(badString.Where(c => c >=' ').ToArray());
}
}

You could try:
string value = "!##$%^&*()123%^&*(!##\(*!&10987"
value = System.Security.SecurityElement.Escape(value);
This will escape characters that are invalid as XML attribute values.

Related

Regex for string without spacial characters or spaces [duplicate]

How do I check a string to make sure it contains numbers, letters, or space only?
In C# this is simple:
private bool HasSpecialChars(string yourString)
{
return yourString.Any(ch => ! char.IsLetterOrDigit(ch));
}
The easiest way it to use a regular expression:
Regular Expression for alphanumeric and underscores
Using regular expressions in .net:
http://www.regular-expressions.info/dotnet.html
MSDN Regular Expression
Regex.IsMatch
var regexItem = new Regex("^[a-zA-Z0-9 ]*$");
if(regexItem.IsMatch(YOUR_STRING)){..}
string s = #"$KUH% I*$)OFNlkfn$";
var withoutSpecial = new string(s.Where(c => Char.IsLetterOrDigit(c)
|| Char.IsWhiteSpace(c)).ToArray());
if (s != withoutSpecial)
{
Console.WriteLine("String contains special chars");
}
Try this way.
public static bool hasSpecialChar(string input)
{
string specialChar = #"\|!#$%&/()=?»«#£§€{}.-;'<>_,";
foreach (var item in specialChar)
{
if (input.Contains(item)) return true;
}
return false;
}
String test_string = "tesintg#$234524##";
if (System.Text.RegularExpressions.Regex.IsMatch(test_string, "^[a-zA-Z0-9\x20]+$"))
{
// Good-to-go
}
An example can be found here: http://ideone.com/B1HxA
If the list of acceptable characters is pretty small, you can use a regular expression like this:
Regex.IsMatch(items, "[a-z0-9 ]+", RegexOptions.IgnoreCase);
The regular expression used here looks for any character from a-z and 0-9 including a space (what's inside the square brackets []), that there is one or more of these characters (the + sign--you can use a * for 0 or more). The final option tells the regex parser to ignore case.
This will fail on anything that is not a letter, number, or space. To add more characters to the blessed list, add it inside the square brackets.
Use the regular Expression below in to validate a string to make sure it contains numbers, letters, or space only:
[a-zA-Z0-9 ]
You could do it with a bool. I've been learning recently and found I could do it this way. In this example, I'm checking a user's input to the console:
using System;
using System.Linq;
namespace CheckStringContent
{
class Program
{
static void Main(string[] args)
{
//Get a password to check
Console.WriteLine("Please input a Password: ");
string userPassword = Console.ReadLine();
//Check the string
bool symbolCheck = userPassword.Any(p => !char.IsLetterOrDigit(p));
//Write results to console
Console.WriteLine($"Symbols are present: {symbolCheck}");
}
}
}
This returns 'True' if special chars (symbolCheck) are present in the string, and 'False' if not present.
A great way using C# and Linq here:
public static bool HasSpecialCharacter(this string s)
{
foreach (var c in s)
{
if(!char.IsLetterOrDigit(c))
{
return true;
}
}
return false;
}
And access it like this:
myString.HasSpecialCharacter();
private bool isMatch(string strValue,string specialChars)
{
return specialChars.Where(x => strValue.Contains(x)).Any();
}
Create a method and call it hasSpecialChar with one parameter
and use foreach to check every single character in the textbox, add as many characters as you want in the array, in my case i just used ) and ( to prevent sql injection .
public void hasSpecialChar(string input)
{
char[] specialChar = {'(',')'};
foreach (char item in specialChar)
{
if (input.Contains(item)) MessageBox.Show("it contains");
}
}
in your button click evenement or you click btn double time like that :
private void button1_Click(object sender, EventArgs e)
{
hasSpecialChar(textbox1.Text);
}
While there are many ways to skin this cat, I prefer to wrap such code into reusable extension methods that make it trivial to do going forward. When using extension methods, you can also avoid RegEx as it is slower than a direct character check. I like using the extensions in the Extensions.cs NuGet package. It makes this check as simple as:
Add the [https://www.nuget.org/packages/Extensions.cs][1] package to your project.
Add "using Extensions;" to the top of your code.
"smith23#".IsAlphaNumeric() will return False whereas "smith23".IsAlphaNumeric() will return True. By default the .IsAlphaNumeric() method ignores spaces, but it can also be overridden such that "smith 23".IsAlphaNumeric(false) will return False since the space is not considered part of the alphabet.
Every other check in the rest of the code is simply MyString.IsAlphaNumeric().
Based on #prmph's answer, it can be even more simplified (omitting the variable, using overload resolution):
yourString.Any(char.IsLetterOrDigit);
No special characters or empty string except hyphen
^[a-zA-Z0-9-]+$

Round-trip-safe escaping of strings in C#

I am confused by all the different escaping mechanisms for strings in C#. What I want is an escaping/unescaping method that:
1) Can be used on any string
2) escape+unescape is guaranteed to return the initial string
3) Replaces all punctuation with something else. If that is too much to ask, then at least commas, braces, and #. I am fine with spaces not being escaped.
4) Is unlikely to ever change.
Does it exist?
EDIT: This is for purposes of seriliazing and deserializing app-generated attributes. So my object may or may not have values for Attribute1, Attribute2, Attribute3, etc. Simplifying a bit, the idea is to do something like the below. Goal is to have the encoded collection be brief and more-or-less human-readable.
I am asking what methods would make sense to use for Escape and Unescape.
public abstract class GenericAttribute {
const string key1 = "KEY1"; //It is fine to put some restrictions on the keys, i.e. no punctuation
const string key2 = "KEY2";
public abstract string Encode(); // NO RESTRICTIONS ON WHAT ENCODE MIGHT RETURN
public static GenericAttribute FromKeyValuePair (string key, string value) {
switch (key) {
case key1: return new ConcreteAttribute1(value);
case key2: return new ConcreteAttribute2(value);
// etc.
}
}
}
public class AttributeCollection {
Dictionary <string, GenericAttribute> Content {get;set;}
public string Encode() {
string r = "";
bool first = true;
foreach (KeyValuePair<string, GenericAttribute> pair in this.Content) {
if (first) {
first = false;
} else {
r+=",";
}
r+=(pair.Key + "=" + Escape(pair.Value.Encode()));
}
return r;
}
public AttributeCollection(string encodedCollection) {
// input string is the return value of the Encode method
this.Content = new Dictionary<string, GenericAttribute>();
string[] array = encodedCollection.Split(',');
foreach(string component in array) {
int equalsIndex = component.IndexOf('=');
string key = component.Substring(0, equalsIndex);
string value = component.Substring(equalsIndex+1);
GenericAttribute attribute = GenericAttribute.FromKeyValuePair(key, Unescape(value));
this.Content[key]=attribute;
}
}
}
I'm not entirely sure what your asking, but I believe your intent is for the escaped character to be included, even with the escape.
var content = #"\'Hello";
Console.WriteLine(content);
// Output:
\'Hello
By utilizing the # it will include said escaping, making it apart of your string. That is for the server-side with C#, to account for other languages and escape formats only you would know that.
You can find some great information on C# escaping here:
MSDN Blog
Try using HttpServerUtility.UrlEncode and HttpServerUtility.UrlDecode. I think that will encode and decode all the things you want.
See the MSDN Docs and here is a description of the mapping on Wikipedia.

Get List<String> of only one length when I split String with character array

I use this property to help users filter results, by specifying what shouldn't appear. They can separate their terms with all the characters in INCLUDE_INTERPRET_SEPARATORS. The String is saved to an XML file at startup and close.
However, the List always end up with only 1 index. I wondered for some time whether it had to do with loading the values through XML deserialization, but breakpoints confirmed that the application uses the setters on startup.
After the update, I've confirmed that the splitting will work in a different environment. I still don't know why this code didn't work originally.
_Exclude and Exclude below, are different types on purpose.
private readonly char[] INCLUDE_INTERPRET_SEPARATORS = {';', '|', '+'};
private const string INCLUDE_SEPARATOR = ";";
private List<string> _Exclude = new List<string>();
[DataMember()]
public string Exclude
{
get
{
return String.Join(INCLUDE_SEPARATOR, _Exclude);
}
set
{
string input = Utils.RemoveDiacritics(value);
_Exclude = new List<string>(input.Split(INCLUDE_INTERPRET_SEPARATORS, StringSplitOptions.RemoveEmptyEntries));
onPropertyChanged("Exclude");
}
}
Example
In my XML file I have (amongst other things)
<Episode>9</Episode>
<Exclude>WEB-DL;1080i;MPEG</Exclude>
<FilterEpisode>true</FilterEpisode>
Breakpoints show that Exclude is set to
Index Value Type
[0] "WEB-DL;1080i;MPEG" String
Am I missing something obvious about this?
Update
I made a test on dotnetfiddle and found that the code works in a simplified environment, without DataContractSerializer.
Similarly, when I add an extra property, it works:
private readonly char[] INCLUDE_INTERPRET_SEPARATORS = {';', '|', '+'};
private const string INCLUDE_SEPARATOR = ";";
[IgnoreDataMember()]
public List<string> ExcludeList
{
get
{
return new List<string>(Exclude.Split(INCLUDE_INTERPRET_SEPARATORS, StringSplitOptions.RemoveEmptyEntries));
}
}
private string _Exclude = "";
[DataMember()]
public string Exclude
{
get
{
return _Exclude;
}
set
{
_Exclude = Utils.RemoveDiacritics(value);
foreach (string x in ExcludeList)
{
System.Diagnostics.Debug.WriteLine(x);
}
onPropertyChanged("Exclude");
}
}
Update 2
I figured out what the problem is. INCLUDE_INTERPRET_SEPARATORS is empty when the class object is loaded from XML deserialization. The string doesn't get split. By making fields like these static, they will still be initialized on startup.
Yes, you are missing something. When you set your breakpoint, check the value of _Exclude, not Exclude.

Extension error assignment, how to fix?

We suppose that for example i have a string, and i want to escape it, and to be well reading)
need a working extension what will solve this problem
i tried.
var t = "'";
t.Escape();// == "%27" (what i need), but it not assign result to var. t
t = t.Escape();//works, but ugly.
and the extension
public static string Escape(this string string_2)
{
if (string_2.HasValue())
string_2 = Uri.EscapeDataString(string_2);
return string_2;
}
how to fix this extension be working?
t = t.Escape(); is the usual idiom in .NET for changing a string. E.g. t = t.Replace("a", "b"); I'd recommend you use this. This is necessary because strings are immutable.
There are ways around it, but they are uglier IMO. For example, you could use a ref parameter (but not on an extension method):
public static string Escape (ref string string_2) { ... }
Util.Escape(ref t);
Or you could make your own String-like class that's mutable:
public class MutableString { /** include implicit conversions to/from string */ }
public static string Escape (this MutableString string_2) { ... }
MutableString t = "'";
t.Escape();
I'd caution you that if you use anything besides t = t.Escape();, and thus deviate from normal usage, you are likely to confuse anyone that reads the code in the future.
"Mutable string" in C# is spelled StringBuilder.
So you could do something like this:
public static void Escape(this StringBuilder text)
{
var s = text.ToString();
text.Clear();
text.Append(Uri.EscapeDataString(s));
}
But using it wouldn't really be that great:
StringBuilder test = new StringBuilder("'");
test.Escape();
Console.WriteLine(test);
The real answer is to use the "ugly" string reassignment
t = t.Escape();//works, but ugly.
You'll get used to it. :)

How to validate that a string doesn't contain HTML using C#

Does anyone have a simple, efficient way of checking that a string doesn't contain HTML? Basically, I want to check that certain fields only contain plain text. I thought about looking for the < character, but that can easily be used in plain text. Another way might be to create a new System.Xml.Linq.XElement using:
XElement.Parse("<wrapper>" + MyString + "</wrapper>")
and check that the XElement contains no child elements, but this seems a little heavyweight for what I need.
The following will match any matching set of tags. i.e. <b>this</b>
Regex tagRegex = new Regex(#"<\s*([^ >]+)[^>]*>.*?<\s*/\s*\1\s*>");
The following will match any single tag. i.e. <b> (it doesn't have to be closed).
Regex tagRegex = new Regex(#"<[^>]+>");
You can then use it like so
bool hasTags = tagRegex.IsMatch(myString);
You could ensure plain text by encoding the input using HttpUtility.HtmlEncode.
In fact, depending on how strict you want the check to be, you could use it to determine if the string contains HTML:
bool containsHTML = (myString != HttpUtility.HtmlEncode(myString));
Here you go:
using System.Text.RegularExpressions;
private bool ContainsHTML(string checkString)
{
return Regex.IsMatch(checkString, "<(.|\n)*?>");
}
That is the simplest way, since items in brackets are unlikely to occur naturally.
I just tried my XElement.Parse solution. I created an extension method on the string class so I can reuse the code easily:
public static bool ContainsXHTML(this string input)
{
try
{
XElement x = XElement.Parse("<wrapper>" + input + "</wrapper>");
return !(x.DescendantNodes().Count() == 1 && x.DescendantNodes().First().NodeType == XmlNodeType.Text);
}
catch (XmlException ex)
{
return true;
}
}
One problem I found was that plain text ampersand and less than characters cause an XmlException and indicate that the field contains HTML (which is wrong). To fix this, the input string passed in first needs to have the ampersands and less than characters converted to their equivalent XHTML entities. I wrote another extension method to do that:
public static string ConvertXHTMLEntities(this string input)
{
// Convert all ampersands to the ampersand entity.
string output = input;
output = output.Replace("&", "amp_token");
output = output.Replace("&", "&");
output = output.Replace("amp_token", "&");
// Convert less than to the less than entity (without messing up tags).
output = output.Replace("< ", "< ");
return output;
}
Now I can take a user submitted string and check that it doesn't contain HTML using the following code:
bool ContainsHTML = UserEnteredString.ConvertXHTMLEntities().ContainsXHTML();
I'm not sure if this is bullet proof, but I think it's good enough for my situation.
this also checks for things like < br /> self enclosed tags with optional whitespace. the list does not contain new html5 tags.
internal static class HtmlExts
{
public static bool containsHtmlTag(this string text, string tag)
{
var pattern = #"<\s*" + tag + #"\s*\/?>";
return Regex.IsMatch(text, pattern, RegexOptions.IgnoreCase);
}
public static bool containsHtmlTags(this string text, string tags)
{
var ba = tags.Split('|').Select(x => new {tag = x, hastag = text.containsHtmlTag(x)}).Where(x => x.hastag);
return ba.Count() > 0;
}
public static bool containsHtmlTags(this string text)
{
return
text.containsHtmlTags(
"a|abbr|acronym|address|area|b|base|bdo|big|blockquote|body|br|button|caption|cite|code|col|colgroup|dd|del|dfn|div|dl|DOCTYPE|dt|em|fieldset|form|h1|h2|h3|h4|h5|h6|head|html|hr|i|img|input|ins|kbd|label|legend|li|link|map|meta|noscript|object|ol|optgroup|option|p|param|pre|q|samp|script|select|small|span|strong|style|sub|sup|table|tbody|td|textarea|tfoot|th|thead|title|tr|tt|ul|var");
}
}
Angle brackets may not be your only challenge. Other characters can also be potentially harmful script injection. Such as the common double hyphen "--", which can also used in SQL injection. And there are others.
On an ASP.Net page, if validateRequest = true in machine.config, web.config or the page directive, the user will get an error page stating "A potentially dangerous Request.Form value was detected from the client" if an HTML tag or various other potential script-injection attacks are detected. You probably want to avoid this and provide a more elegant, less-scary UI experience.
You could test for both the opening and closing tags <> using a regular expression, and allow the text if only one of them occcurs. Allow < or >, but not < followed by some text and then >, in that order.
You could allow angle brackets and HtmlEncode the text to preserve them when the data is persisted.
Beware when using the HttpUtility.HtmlEncode method mentioned above. If you are checking some text with special characters, but not HTML, it will evaluate incorrectly. Maybe that's why J c used "...depending on how strict you want the check to be..."

Categories