How to validate that a string doesn't contain HTML using C# - c#

Does anyone have a simple, efficient way of checking that a string doesn't contain HTML? Basically, I want to check that certain fields only contain plain text. I thought about looking for the < character, but that can easily be used in plain text. Another way might be to create a new System.Xml.Linq.XElement using:
XElement.Parse("<wrapper>" + MyString + "</wrapper>")
and check that the XElement contains no child elements, but this seems a little heavyweight for what I need.

The following will match any matching set of tags. i.e. <b>this</b>
Regex tagRegex = new Regex(#"<\s*([^ >]+)[^>]*>.*?<\s*/\s*\1\s*>");
The following will match any single tag. i.e. <b> (it doesn't have to be closed).
Regex tagRegex = new Regex(#"<[^>]+>");
You can then use it like so
bool hasTags = tagRegex.IsMatch(myString);

You could ensure plain text by encoding the input using HttpUtility.HtmlEncode.
In fact, depending on how strict you want the check to be, you could use it to determine if the string contains HTML:
bool containsHTML = (myString != HttpUtility.HtmlEncode(myString));

Here you go:
using System.Text.RegularExpressions;
private bool ContainsHTML(string checkString)
{
return Regex.IsMatch(checkString, "<(.|\n)*?>");
}
That is the simplest way, since items in brackets are unlikely to occur naturally.

I just tried my XElement.Parse solution. I created an extension method on the string class so I can reuse the code easily:
public static bool ContainsXHTML(this string input)
{
try
{
XElement x = XElement.Parse("<wrapper>" + input + "</wrapper>");
return !(x.DescendantNodes().Count() == 1 && x.DescendantNodes().First().NodeType == XmlNodeType.Text);
}
catch (XmlException ex)
{
return true;
}
}
One problem I found was that plain text ampersand and less than characters cause an XmlException and indicate that the field contains HTML (which is wrong). To fix this, the input string passed in first needs to have the ampersands and less than characters converted to their equivalent XHTML entities. I wrote another extension method to do that:
public static string ConvertXHTMLEntities(this string input)
{
// Convert all ampersands to the ampersand entity.
string output = input;
output = output.Replace("&", "amp_token");
output = output.Replace("&", "&");
output = output.Replace("amp_token", "&");
// Convert less than to the less than entity (without messing up tags).
output = output.Replace("< ", "< ");
return output;
}
Now I can take a user submitted string and check that it doesn't contain HTML using the following code:
bool ContainsHTML = UserEnteredString.ConvertXHTMLEntities().ContainsXHTML();
I'm not sure if this is bullet proof, but I think it's good enough for my situation.

this also checks for things like < br /> self enclosed tags with optional whitespace. the list does not contain new html5 tags.
internal static class HtmlExts
{
public static bool containsHtmlTag(this string text, string tag)
{
var pattern = #"<\s*" + tag + #"\s*\/?>";
return Regex.IsMatch(text, pattern, RegexOptions.IgnoreCase);
}
public static bool containsHtmlTags(this string text, string tags)
{
var ba = tags.Split('|').Select(x => new {tag = x, hastag = text.containsHtmlTag(x)}).Where(x => x.hastag);
return ba.Count() > 0;
}
public static bool containsHtmlTags(this string text)
{
return
text.containsHtmlTags(
"a|abbr|acronym|address|area|b|base|bdo|big|blockquote|body|br|button|caption|cite|code|col|colgroup|dd|del|dfn|div|dl|DOCTYPE|dt|em|fieldset|form|h1|h2|h3|h4|h5|h6|head|html|hr|i|img|input|ins|kbd|label|legend|li|link|map|meta|noscript|object|ol|optgroup|option|p|param|pre|q|samp|script|select|small|span|strong|style|sub|sup|table|tbody|td|textarea|tfoot|th|thead|title|tr|tt|ul|var");
}
}

Angle brackets may not be your only challenge. Other characters can also be potentially harmful script injection. Such as the common double hyphen "--", which can also used in SQL injection. And there are others.
On an ASP.Net page, if validateRequest = true in machine.config, web.config or the page directive, the user will get an error page stating "A potentially dangerous Request.Form value was detected from the client" if an HTML tag or various other potential script-injection attacks are detected. You probably want to avoid this and provide a more elegant, less-scary UI experience.
You could test for both the opening and closing tags <> using a regular expression, and allow the text if only one of them occcurs. Allow < or >, but not < followed by some text and then >, in that order.
You could allow angle brackets and HtmlEncode the text to preserve them when the data is persisted.

Beware when using the HttpUtility.HtmlEncode method mentioned above. If you are checking some text with special characters, but not HTML, it will evaluate incorrectly. Maybe that's why J c used "...depending on how strict you want the check to be..."

Related

Regex for string without spacial characters or spaces [duplicate]

How do I check a string to make sure it contains numbers, letters, or space only?
In C# this is simple:
private bool HasSpecialChars(string yourString)
{
return yourString.Any(ch => ! char.IsLetterOrDigit(ch));
}
The easiest way it to use a regular expression:
Regular Expression for alphanumeric and underscores
Using regular expressions in .net:
http://www.regular-expressions.info/dotnet.html
MSDN Regular Expression
Regex.IsMatch
var regexItem = new Regex("^[a-zA-Z0-9 ]*$");
if(regexItem.IsMatch(YOUR_STRING)){..}
string s = #"$KUH% I*$)OFNlkfn$";
var withoutSpecial = new string(s.Where(c => Char.IsLetterOrDigit(c)
|| Char.IsWhiteSpace(c)).ToArray());
if (s != withoutSpecial)
{
Console.WriteLine("String contains special chars");
}
Try this way.
public static bool hasSpecialChar(string input)
{
string specialChar = #"\|!#$%&/()=?»«#£§€{}.-;'<>_,";
foreach (var item in specialChar)
{
if (input.Contains(item)) return true;
}
return false;
}
String test_string = "tesintg#$234524##";
if (System.Text.RegularExpressions.Regex.IsMatch(test_string, "^[a-zA-Z0-9\x20]+$"))
{
// Good-to-go
}
An example can be found here: http://ideone.com/B1HxA
If the list of acceptable characters is pretty small, you can use a regular expression like this:
Regex.IsMatch(items, "[a-z0-9 ]+", RegexOptions.IgnoreCase);
The regular expression used here looks for any character from a-z and 0-9 including a space (what's inside the square brackets []), that there is one or more of these characters (the + sign--you can use a * for 0 or more). The final option tells the regex parser to ignore case.
This will fail on anything that is not a letter, number, or space. To add more characters to the blessed list, add it inside the square brackets.
Use the regular Expression below in to validate a string to make sure it contains numbers, letters, or space only:
[a-zA-Z0-9 ]
You could do it with a bool. I've been learning recently and found I could do it this way. In this example, I'm checking a user's input to the console:
using System;
using System.Linq;
namespace CheckStringContent
{
class Program
{
static void Main(string[] args)
{
//Get a password to check
Console.WriteLine("Please input a Password: ");
string userPassword = Console.ReadLine();
//Check the string
bool symbolCheck = userPassword.Any(p => !char.IsLetterOrDigit(p));
//Write results to console
Console.WriteLine($"Symbols are present: {symbolCheck}");
}
}
}
This returns 'True' if special chars (symbolCheck) are present in the string, and 'False' if not present.
A great way using C# and Linq here:
public static bool HasSpecialCharacter(this string s)
{
foreach (var c in s)
{
if(!char.IsLetterOrDigit(c))
{
return true;
}
}
return false;
}
And access it like this:
myString.HasSpecialCharacter();
private bool isMatch(string strValue,string specialChars)
{
return specialChars.Where(x => strValue.Contains(x)).Any();
}
Create a method and call it hasSpecialChar with one parameter
and use foreach to check every single character in the textbox, add as many characters as you want in the array, in my case i just used ) and ( to prevent sql injection .
public void hasSpecialChar(string input)
{
char[] specialChar = {'(',')'};
foreach (char item in specialChar)
{
if (input.Contains(item)) MessageBox.Show("it contains");
}
}
in your button click evenement or you click btn double time like that :
private void button1_Click(object sender, EventArgs e)
{
hasSpecialChar(textbox1.Text);
}
While there are many ways to skin this cat, I prefer to wrap such code into reusable extension methods that make it trivial to do going forward. When using extension methods, you can also avoid RegEx as it is slower than a direct character check. I like using the extensions in the Extensions.cs NuGet package. It makes this check as simple as:
Add the [https://www.nuget.org/packages/Extensions.cs][1] package to your project.
Add "using Extensions;" to the top of your code.
"smith23#".IsAlphaNumeric() will return False whereas "smith23".IsAlphaNumeric() will return True. By default the .IsAlphaNumeric() method ignores spaces, but it can also be overridden such that "smith 23".IsAlphaNumeric(false) will return False since the space is not considered part of the alphabet.
Every other check in the rest of the code is simply MyString.IsAlphaNumeric().
Based on #prmph's answer, it can be even more simplified (omitting the variable, using overload resolution):
yourString.Any(char.IsLetterOrDigit);
No special characters or empty string except hyphen
^[a-zA-Z0-9-]+$

C# Regex, any more efficient way to parse string enclosed by symbol?

I'm not sure if it's okay to ask... But here goes.
I implemented a method that parses a string using regex, each matching are parsed through the delegates with an order ( actually, order is not important-- I think, wait, is it? ... But I wrote it this way, and it's not fully tested ):
Pattern Regex.Replace: #"(?<!\\)\$.+?\$" then String.Replace: #"\$", #"$"; Replace string enclosed by dollar sign. Ignores backslash ones, then erases backslash. Ex: "$global name$" -> "motherofglobalvar", "Money \$9000" -> "Money $9000"
Pattern Regex.Replace #"(?<!\\)%.+?%" then String.Replace #"\%", #"%"; Replace string enclosed by percentage sign. Ignores backslash ones, then erase backslash. Same as previous example: "%local var%" -> "lordoflocalvar", "It's over 9000\%" -> "It's over 9000%"
Pattern Regex.Replace #"(?<!\\)#" then String.Replace #"\#", #"#"; Replace char '#' with whitespace, ' '. But ignore backslash ones, then erase the backslash. Ex: "I#hit#the#ground#too#hard" -> "I hit the ground too hard", "qw\#op" -> "qw#op"
What I've done without much experience (I think):
//parse variable
public static string ParseVariable(string text)
{
return Regex.Replace(Regex.Replace(Regex.Replace(text, #"(?<!\\)\$.+?\$", match =>
{
string trim = match.Value.Trim('$');
string trimUpper = trim.ToUpper();
return variableGlobal.ContainsKey(trim) ? variableGlobal[trim] : match.Value;
}).Replace(#"\$", #"$"), #"(?<!\\)%.+?%", match =>
{
string trim = match.Value.Trim('%');
string trimUpper = trim.ToUpper();
return variableLocal.ContainsKey(trim) ? variableLocal[trim] : match.Value;
}).Replace(#"\%", #"%"), #"(?<!\\)#", " ").Replace(#"\#", #"#");
}
In short, what I used is: Regex.Replace().Replace()
Since I need to parse 3 kinds of symbols, I chained it as following: Regex.Replace(Regex.Replace(Regex.Replace().Replace()).Replace()).Replace()
Is there any more efficient way than this? I mean, like without need to go through the text 6 times? (3 times regex.replace, 3 times string.replace, where each replace modifies the text to be used by the next replace )
Or is it the best way it can do?
Thanks.
Here's a unique take on the problem, I think. You can build a class that will be used to construct the overall pattern piece-by-piece. This class will be responsible for the generating of the MatchEvaluator delegate that will be passed to Replace as well.
class RegexReplacer
{
public string Pattern { get; private set; }
public string Replacement { get; private set; }
public string GroupName { get; private set; }
public RegexReplacer NextReplacer { get; private set; }
public RegexReplacer(string pattern, string replacement, string groupName, RegexReplacer nextReplacer = null)
{
this.Pattern = pattern;
this.Replacement = replacement;
this.GroupName = groupName;
this.NextReplacer = nextReplacer;
}
public string GetAggregatedPattern()
{
string constructedPattern = this.Pattern;
string alternation = (this.NextReplacer == null ? string.Empty : "|" + this.NextReplacer.GetAggregatedPattern()); // If there isn't another replacer, then we won't have an alternation; otherwise, we build an alternation between this pattern and the next replacer's "full" pattern
constructedPattern = string.Format("(?<{0}>{1}){2}", this.GroupName, this.Pattern, alternation); // The (?<XXX>) syntax builds a named capture group. This is used by our GetReplacementDelegate metho.
return constructedPattern;
}
public MatchEvaluator GetReplaceDelegate()
{
return (match) =>
{
if (match.Groups[this.GroupName] != null && match.Groups[this.GroupName].Length > 0) // Did we get a hit on the group name?
{
return this.Replacement;
}
else if (this.NextReplacer != null) // No? Then is there another replacer to inspect?
{
MatchEvaluator next = this.NextReplacer.GetReplaceDelegate();
return next(match);
}
else
{
return match.Value; // No? Then simply return the value
}
};
}
}
It should be obvious as to what Pattern and Replacement represent. GroupName is kind of a hack to let the replacement evaluator know which RegexReplacer fragment resulted in the match. NextReplacer points to another replacer instance that holds a different pattern fragment (et al.).
The idea here is to have a kind of linked list of objects that will represent the overall pattern. You can call GetAggregatedPattern on the outer-most replacer to get the full pattern--each replacer calls the next replacer's GetAggregatedPattern to get that replacer's patter fragment, to which it concatenates its own fragment. The GetReplacementDelegate generates a MatchEvaluator. This MatchEvaluator will compare its own GroupName to the Match's captured groups. If the group name was captured, then we have a hit, and we return this replacer's Replacement value. Otherwise, we step into the next replacer (if there is one) and repeat the group name comparison. If there is no hit on any replacer, then we simply yield back the original value (i.e. what was matched by the pattern; this should be rare).
The usage of such might look like this:
string target = #"$global name$ Money \$9000 %local var% It's over 9000\% I#hit#the#ground#too#hard qw\#op";
RegexReplacer dollarWrapped = new RegexReplacer(#"(?<!\\)\$[^$]+\$", "motherofglobalvar", "dollarWrapped");
RegexReplacer slashDollar = new RegexReplacer(#"\\\$", string.Empty, "slashDollar", dollarWrapped);
RegexReplacer percentWrapped = new RegexReplacer(#"(?<!\\)%[^%]+%", "lordoflocalvar", "percentWrapped", slashDollar);
RegexReplacer slashPercent = new RegexReplacer(#"\\%", string.Empty, "slashPercent", percentWrapped);
RegexReplacer singleAt = new RegexReplacer(#"(?<!\\)#", " ", "singleAt", slashPercent);
RegexReplacer slashAt = new RegexReplacer(#"\\#", "#", "slashAt", singleAt);
RegexReplacer replacer = slashAt;
string pattern = replacer.GetAggregatedPattern();
MatchEvaluator evaluator = replacer.GetReplaceDelegate();
string result = Regex.Replace(target, pattern, evaluator);
Because you want each replacer to know if it got a hit, and because we are hacking this by using group names, you want to make sure that each group name is distinct. A simple way to ensure this would be to use a name that's identical to the variable name since you can't have two variables with the same name within the same scope.
You can see above that I am building each part of the pattern separately, but as I build, I pass the previous replacer as a 4th parameter to the current replacer. This builds the chain of replacers. Once built, I use the last replacer constructed in order to generate the overall pattern and evaluator. If you use anything but, then you will only have part of the overall pattern. Finally, it's simply a matter of passing the generated pattern and evaluator to the Replace method.
Keep in mind that this approach was targeted more at the problem as described. It may work in more general scenarios, but I've only worked with what you've presented. Also, since this is more of a parsing question, a parser may be the proper route to take--although the learning curve is going to be higher.
Also keep in mind that I haven't profiled this code. It certainly doesn't loop over the target string multiple times, but it does involve additional method calls during replacement. You would certainly want to test it in your environment.

Best way to provide the user an escape string

Suppose I want to ask a user what format they want a certain output to be in and the output will include fill-in fields. So they provide something like this string:
"Output text including some field {FieldName1Value} and another {FieldName2Value} and so on..."
Anything bound by the {} should be a column name in a table somewhere they will be replaced with the the stored value with the code I am writing. Seems simple, I could just do a string.Replace on any instance that matches the patter "{" + FieldName + "}". But, what if I also want to give the user the option of using an escape so they can use brackets like any other string. I was thinking they provide "{{" or "}}" to escape that bracket - nice and easy for them. So, they could provide something like:
"Output text including some field {FieldName1Value} and another {FieldName2Value} but not this {{FieldName2Value}}"
But now that "{{FieldName2Value}}" is to be treated like any other string and ignored by the by the Replace. Also, if they decided to put something like "{{{FieldName2Value}}}" with the triple brackets, that would be interpreted by the code as the field value wrapped with brackets and so on.
This is where I get stuck. I am trying with RegEx and came up with this:
public object Convert(object[] values, Type targetType, object parameter, CultureInfo culture)
{
string format = (string)values[0];
ObservableCollection<CalloutFieldAliasMap> oc = (ObservableCollection<CalloutFieldAliasMap>)values[1];
foreach (CalloutFieldMap map in oc)
format = Regex.Replace(format, #"(?<!{){" + map.FieldName + "(?<!})}", " " + map.FieldAlias + " ", RegexOptions.IgnoreCase);
return format;
}
This works in the situation with double brackets {{ }} but NOT if there are three, ie {{{ }}}. The triple brackets are treated like string when it should be treated as {FieldValue}.
Thanks for any help.
By expanding on your regular expression, the presence of literals can be accommodated.
format = Regex.Replace(format,
#"(?<!([^{]|^){(?:{{)*){" + Regex.Escape(map.FieldName) + "}",
String.Format(" {0} ", map.FieldAlias),
RegexOptions.IgnoreCase | RegexOptions.Compiled);
The first part of the expression, (?<!([^{]|^){(?:{{)*){, designates that the { must be preceded by an even number of { characters for it to mark the beginning of a field token. Thus, {FieldName} and {{{FieldName} will denote the start of a field name, whereas {{FieldName} and {{{{FieldName} would not.
The closing } simply requires that the end of the field be a simple }. There is some ambiguity in the syntax in that {FieldName1Value}}} could be parsed as a token with FieldName1Value (followed by the literal }) or FieldName1Value}. The regex assumes the former. (If the latter is intended, you could replace this with }(?!}(}})*) instead.
A couple of other notes. I added Regex.Escape(map.FieldName) so that all characters in the field name are treated as literals; and added the RegexOptions.Compiled flag. (Since this is both a complex expression and executed in a loop, it is a good candidate for compilation.)
After the loop executes, a simple:
format = format.Replace("{{", "{").Replace("}}", "}")
can be used to unescape the literal {{ and }} characters.
The simplest way would be to use String.Replace to replace the double brackets with a character sequence that the user can not (or almost certainly will not) enter. Then do the replacement of your fields, and finally convert replacement back to the double brackets.
For example, given:
string replaceOpen = "{x"; // 'x' should be something like \u00ff, for example
string replaceClose = "x}";
string template = "Replace {ThisField} but not {{ThatField}}";
string temp = template.Replace("{{", replaceOpen).Replace("}}", replaceClose);
string converted = temp.Replace("{ThisField}", "Foo");
string final = converted.Replace(replaceOpen, "{{").Replace(replaceClose, "}});
It's not particularly pretty, but it's effective.
How you go about it is going to depend in large part on how often you call this, and how fast you really need it to be.
I have an extension method I wrote that almost does what you ask, but, while it does escape using double braces, it doesn't do the triple braces like you suggested. Here is the method (also on GitHub at https://github.com/benallred/Icing/blob/master/Icing/Icing.Core/StringExtensions.cs):
private const string FormatTokenGroupName = "token";
private static readonly Regex FormatRegex = new Regex(#"(?<!\{)\{(?<" + FormatTokenGroupName + #">\w+)\}(?!\})", RegexOptions.Compiled);
public static string Format(this string source, IDictionary<string, string> replacements)
{
if (string.IsNullOrWhiteSpace(source) || replacements == null)
{
return source;
}
string replaced = replacements.Aggregate(source,
(current, pair) =>
FormatRegex.Replace(current,
new MatchEvaluator(match =>
(match.Groups[FormatTokenGroupName].Value == pair.Key
? pair.Value : match.Value))));
return replaced.Replace("{{", "{").Replace("}}", "}");
}
Usage:
"This is my {FieldName}".Format(new Dictionary<string, string>() { { "FieldName", "value" } });
Even easier if you add this:
public static string Format(this string source, object replacements)
{
if (string.IsNullOrWhiteSpace(source) || replacements == null)
{
return source;
}
IDictionary<string, string> replacementsDictionary = new Dictionary<string, string>();
foreach (PropertyDescriptor propertyDescriptor in TypeDescriptor.GetProperties(replacements))
{
string token = propertyDescriptor.Name;
object value = propertyDescriptor.GetValue(replacements);
replacementsDictionary.Add(token, (value != null ? value.ToString() : String.Empty));
}
return Format(source, replacementsDictionary);
}
Usage:
"This is my {FieldName}".Format(new { FieldName = "value" });
Unit tests for this method are at https://github.com/benallred/Icing/blob/master/Icing/Icing.Tests/Core/TestOf_StringExtensions.cs
If this doesn't work, what would your ideal solution do for more than three braces? In other words, if {{{FieldName}}} becomes {value}, what does {{{{FieldName}}}} become? What about {{{{{FieldName}}}}} and so on? While those cases are unlikely, they still need to be handled purposefully.
RegEx will not do what you want because it only knows it's current state and what transitions are available. It has no concept of memory. The language you're trying parse is not regular so you will never be able to write a RegEx to handle the general case. You would need i expressions where i is the number of matching braces.
There is a lot of theory behind this and I'll provide some links at the bottom if you're curious. But basically the language you're trying to parse is context-free and to implement a general solution you'll need model a push down automaton, which uses a stack to ensure that an opening brace has a matching closing brace (yes, this is why most languages have matching braces).
Each time you encounter { you put it on the stack. If you encounter } you pop from the stack. When you empty the stack you will know that you've reached the end of a field. Of course that's a major simplification of the problem, but if you're looking for a general solution it should get you moving in the right direction.
http://en.wikipedia.org/wiki/Regular_language
http://en.wikipedia.org/wiki/Context-free_language
http://en.wikipedia.org/wiki/Pushdown_automaton

Is there a way of making strings file-path safe in c#?

My program will take arbitrary strings from the internet and use them for file names. Is there a simple way to remove the bad characters from these strings or do I need to write a custom function for this?
Ugh, I hate it when people try to guess at which characters are valid. Besides being completely non-portable (always thinking about Mono), both of the earlier comments missed more 25 invalid characters.
foreach (var c in Path.GetInvalidFileNameChars())
{
fileName = fileName.Replace(c, '-');
}
Or in VB:
'Clean just a filename
Dim filename As String = "salmnas dlajhdla kjha;dmas'lkasn"
For Each c In IO.Path.GetInvalidFileNameChars
filename = filename.Replace(c, "")
Next
'See also IO.Path.GetInvalidPathChars
To strip invalid characters:
static readonly char[] invalidFileNameChars = Path.GetInvalidFileNameChars();
// Builds a string out of valid chars
var validFilename = new string(filename.Where(ch => !invalidFileNameChars.Contains(ch)).ToArray());
To replace invalid characters:
static readonly char[] invalidFileNameChars = Path.GetInvalidFileNameChars();
// Builds a string out of valid chars and an _ for invalid ones
var validFilename = new string(filename.Select(ch => invalidFileNameChars.Contains(ch) ? '_' : ch).ToArray());
To replace invalid characters (and avoid potential name conflict like Hell* vs Hell$):
static readonly IList<char> invalidFileNameChars = Path.GetInvalidFileNameChars();
// Builds a string out of valid chars and replaces invalid chars with a unique letter (Moves the Char into the letter range of unicode, starting at "A")
var validFilename = new string(filename.Select(ch => invalidFileNameChars.Contains(ch) ? Convert.ToChar(invalidFileNameChars.IndexOf(ch) + 65) : ch).ToArray());
This question has been asked many times before and, as pointed out many times before, IO.Path.GetInvalidFileNameChars is not adequate.
First, there are many names like PRN and CON that are reserved and not allowed for filenames. There are other names not allowed only at the root folder. Names that end in a period are also not allowed.
Second, there are a variety of length limitations. Read the full list for NTFS here.
Third, you can attach to filesystems that have other limitations. For example, ISO 9660 filenames cannot start with "-" but can contain it.
Fourth, what do you do if two processes "arbitrarily" pick the same name?
In general, using externally-generated names for file names is a bad idea. I suggest generating your own private file names and storing human-readable names internally.
I agree with Grauenwolf and would highly recommend the Path.GetInvalidFileNameChars()
Here's my C# contribution:
string file = #"38?/.\}[+=n a882 a.a*/|n^%$ ad#(-))";
Array.ForEach(Path.GetInvalidFileNameChars(),
c => file = file.Replace(c.ToString(), String.Empty));
p.s. -- this is more cryptic than it should be -- I was trying to be concise.
Here's my version:
static string GetSafeFileName(string name, char replace = '_') {
char[] invalids = Path.GetInvalidFileNameChars();
return new string(name.Select(c => invalids.Contains(c) ? replace : c).ToArray());
}
I'm not sure how the result of GetInvalidFileNameChars is calculated, but the "Get" suggests it's non-trivial, so I cache the results. Further, this only traverses the input string once instead of multiple times, like the solutions above that iterate over the set of invalid chars, replacing them in the source string one at a time. Also, I like the Where-based solutions, but I prefer to replace invalid chars instead of removing them. Finally, my replacement is exactly one character to avoid converting characters to strings as I iterate over the string.
I say all that w/o doing the profiling -- this one just "felt" nice to me. : )
Here's the function that I am using now (thanks jcollum for the C# example):
public static string MakeSafeFilename(string filename, char replaceChar)
{
foreach (char c in System.IO.Path.GetInvalidFileNameChars())
{
filename = filename.Replace(c, replaceChar);
}
return filename;
}
I just put this in a "Helpers" class for convenience.
If you want to quickly strip out all special characters which is sometimes more user readable for file names this works nicely:
string myCrazyName = "q`w^e!r#t#y$u%i^o&p*a(s)d_f-g+h=j{k}l|z:x\"c<v>b?n[m]q\\w;e'r,t.y/u";
string safeName = Regex.Replace(
myCrazyName,
"\W", /*Matches any nonword character. Equivalent to '[^A-Za-z0-9_]'*/
"",
RegexOptions.IgnoreCase);
// safeName == "qwertyuiopasd_fghjklzxcvbnmqwertyu"
Here's what I just added to ClipFlair's (http://github.com/Zoomicon/ClipFlair) StringExtensions static class (Utils.Silverlight project), based on info gathered from the links to related stackoverflow questions posted by Dour High Arch above:
public static string ReplaceInvalidFileNameChars(this string s, string replacement = "")
{
return Regex.Replace(s,
"[" + Regex.Escape(new String(System.IO.Path.GetInvalidPathChars())) + "]",
replacement, //can even use a replacement string of any length
RegexOptions.IgnoreCase);
//not using System.IO.Path.InvalidPathChars (deprecated insecure API)
}
static class Utils
{
public static string MakeFileSystemSafe(this string s)
{
return new string(s.Where(IsFileSystemSafe).ToArray());
}
public static bool IsFileSystemSafe(char c)
{
return !Path.GetInvalidFileNameChars().Contains(c);
}
}
Why not convert the string to a Base64 equivalent like this:
string UnsafeFileName = "salmnas dlajhdla kjha;dmas'lkasn";
string SafeFileName = Convert.ToBase64String(Encoding.UTF8.GetBytes(UnsafeFileName));
If you want to convert it back so you can read it:
UnsafeFileName = Encoding.UTF8.GetString(Convert.FromBase64String(SafeFileName));
I used this to save PNG files with a unique name from a random description.
private void textBoxFileName_KeyPress(object sender, KeyPressEventArgs e)
{
e.Handled = CheckFileNameSafeCharacters(e);
}
/// <summary>
/// This is a good function for making sure that a user who is naming a file uses proper characters
/// </summary>
/// <param name="e"></param>
/// <returns></returns>
internal static bool CheckFileNameSafeCharacters(System.Windows.Forms.KeyPressEventArgs e)
{
if (e.KeyChar.Equals(24) ||
e.KeyChar.Equals(3) ||
e.KeyChar.Equals(22) ||
e.KeyChar.Equals(26) ||
e.KeyChar.Equals(25))//Control-X, C, V, Z and Y
return false;
if (e.KeyChar.Equals('\b'))//backspace
return false;
char[] charArray = Path.GetInvalidFileNameChars();
if (charArray.Contains(e.KeyChar))
return true;//Stop the character from being entered into the control since it is non-numerical
else
return false;
}
From my older projects, I've found this solution, which has been working perfectly over 2 years. I'm replacing illegal chars with "!", and then check for double !!'s, use your own char.
public string GetSafeFilename(string filename)
{
string res = string.Join("!", filename.Split(Path.GetInvalidFileNameChars()));
while (res.IndexOf("!!") >= 0)
res = res.Replace("!!", "!");
return res;
}
I find using this to be quick and easy to understand:
<Extension()>
Public Function MakeSafeFileName(FileName As String) As String
Return FileName.Where(Function(x) Not IO.Path.GetInvalidFileNameChars.Contains(x)).ToArray
End Function
This works because a string is IEnumerable as a char array and there is a string constructor string that takes a char array.
Many anwer suggest to use Path.GetInvalidFileNameChars() which seems like a bad solution to me. I encourage you to use whitelisting instead of blacklisting because hackers will always find a way eventually to bypass it.
Here is an example of code you could use :
string whitelist = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ.";
foreach (char c in filename)
{
if (!whitelist.Contains(c))
{
filename = filename.Replace(c, '-');
}
}

CSV Parsing with double quotes

I am trying to use C# to parse CSV. I used regular expressions to find "," and read string if my header counts were equal to my match count.
Now this will not work if I have a value like:
"a",""b","x","y"","c"
then my output is:
'a'
'"b'
'x'
'y"'
'c'
but what I want is:
'a'
'"b","x","y"'
'c'
Is there any regex or any other logic I can use for this ?
CSV, when dealing with things like multi-line, quoted, different delimiters* etc - can get trickier than you might think... perhaps consider a pre-rolled answer? I use this, and it works very well.
*=remember that some locales use [tab] as the C in CSV...
CSV is a great example for code reuse - No matter which one of the csv parsers you choose, don't choose your own. Stop Rolling your own CSV parser
I would use FileHelpers if I were you. Regular Expressions are fine but hard to read, especially if you go back, after a while, for a quick fix.
Just for sake of exercising my mind, quick & dirty working C# procedure:
public static List<string> SplitCSV(string line)
{
if (string.IsNullOrEmpty(line))
throw new ArgumentException();
List<string> result = new List<string>();
bool inQuote = false;
StringBuilder val = new StringBuilder();
// parse line
foreach (var t in line.Split(','))
{
int count = t.Count(c => c == '"');
if (count > 2 && !inQuote)
{
inQuote = true;
val.Append(t);
val.Append(',');
continue;
}
if (count > 2 && inQuote)
{
inQuote = false;
val.Append(t);
result.Add(val.ToString());
continue;
}
if (count == 2 && !inQuote)
{
result.Add(t);
continue;
}
if (count == 2 && inQuote)
{
val.Append(t);
val.Append(',');
continue;
}
}
// remove quotation
for (int i = 0; i < result.Count; i++)
{
string t = result[i];
result[i] = t.Substring(1, t.Length - 2);
}
return result;
}
There's an oft quoted saying:
Some people, when confronted with a
problem, think "I know, I'll use
regular expressions." Now they have
two problems. (Jamie Zawinski)
Given that there's no official standard for CSV files (instead there are a large number of slightly incompatible styles), you need to make sure that what you implement suits the files you will be receiving. No point in implementing anything fancier than what you need - and I'm pretty sure you don't need Regular Expressions.
Here's my stab at a simple method to extract the terms - basically, it loops through the line looking for commas, keeping track of whether the current index is within a string or not:
public IEnumerable<string> SplitCSV(string line)
{
int index = 0;
int start = 0;
bool inString = false;
foreach (char c in line)
{
switch (c)
{
case '"':
inString = !inString;
break;
case ',':
if (!inString)
{
yield return line.Substring(start, index - start);
start = index + 1;
}
break;
}
index++;
}
if (start < index)
yield return line.Substring(start, index - start);
}
Standard caveat - untested code, there may be off-by-one errors.
Limitations
The quotes around a value aren't removed automatically.
To do this, add a check just before the yield return statement near the end.
Single quotes aren't supported in the same way as double quotes
You could add a separate boolean inSingleQuotedString, renaming the existing boolean to inDoubleQuotedString and treating both the same way. (You can't make the existing boolean do double work because you need the string to end with the same quote that started it.)
Whitespace isn't automatically removed
Some tools introduce whitespace around the commas in CSV files to "pretty" the file; it then becomes difficult to tell intentional whitespace from formatting whitespace.
In order to have a parseable CSV file, any double quotes inside a value need to be properly escaped somehow. The two standard ways to do this are by representing a double quote either as two double quotes back to back, or a backslash double quote. That is one of the following two forms:
""
\"
In the second form your initial string would look like this:
"a","\"b\",\"x\",\"y\"","c"
If your input string is not formatted against some rigorous format like this then you have very little chance of successfully parsing it in an automated environment.
If all your values are guaranteed to be in quotes, look for values, not for commas:
("".*?""|"[^"]*")
This takes advantage of the fact that "the earliest longest match wins" - it looks for double quoted values first, and with a lower priority for normal quoted values.
If you don't want the enclosing quote to be part of the match, use:
"(".*?"|[^"]*)"
and go for the value in match group 1.
As I said: Prerequisite for this to work is well-formed input with guaranteed quotes or double quotes around each value. Empty values must be quoted as well! A nice side-effect is that it does not care for the separator char. Commas, TABs, semi-colons, spaces, you name it. All will work.
FileHelpers supports multiline fields.
You could parse files like these:
a,"line 1
line 2
line 3"
b,"line 1
line 2
line 3"
Here is the datatype declaration:
[DelimitedRecord(",")]
public class MyRecord
{
public string field1;
[FieldQuoted('"', QuoteMode.OptionalForRead, MultilineMode.AllowForRead)]
public string field2;
}
Here is the usage:
static void Main()
{
FileHelperEngine engine = new FileHelperEngine(typeof(MyRecord));
MyRecord[] res = engine.ReadFile("file.csv");
}
Try CsvHelper (a library I maintain) or FastCsvReader. Both work well. CsvHelper does writing also. Like everyone else has been saying, don't roll your own. :P
FileHelpers for .Net is your friend.
See the link "Regex fun with CSV" at:
http://snippets.dzone.com/posts/show/4430
The Lumenworks CSV parser (open source, free but needs a codeproject login) is by far the best one I've used. It'll save you having to write the regex and is intuitive to use.
Well, I'm no regex wiz, but I'm certain they have an answer for this.
Procedurally it's going through letter by letter. Set a variable, say dontMatch, to FALSE.
Each time you run into a quote toggle dontMatch.
each time you run into a comma, check dontMatch. If it's TRUE, ignore the comma. If it's FALSE, split at the comma.
This works for the example you give, but the logic you use for quotation marks is fundamentally faulty - you must escape them or use another delimiter (single quotes, for instance) to set major quotations apart from minor quotations.
For instance,
"a", ""b", ""c", "d"", "e""
will yield bad results.
This can be fixed with another patch. Rather than simply keeping a true false you have to match quotes.
To match quotes you have to know what was last seen, which gets into pretty deep parsing territory. You'll probably, at that point, want to make sure your language is designed well, and if it is you can use a compiler tool to create a parser for you.
-Adam
I have just try your regular expression in my code..its work fine for formated text with quote ...
but wondering if we can parse below value by Regex..
"First_Bat7679",""NAME","ENAME","FILE"","","","From: "DDD,_Ala%as"#sib.com"
I am looking for result as:
'First_Bat7679'
'"NAME","ENAME","FILE"'
''
''
'From: "DDD,_Ala%as"#sib.com'
Thanx

Categories