RegularExpressionAttribute Equivalent for SymbolUtilityServices.ValidateSymbolName

RegularExpressionAttribute Equivalent for SymbolUtilityServices.ValidateSymbolName - c#

In AutoCAD there is a utility for determining if a string is valid for a symbol name, i.e. a Block or Layer name for instance. This utility is:
try
{
// Validate the provided symbol table name
SymbolUtilityServices.ValidateSymbolName(s, false);
System.Windows.Forms.MessageBox.Show(s + " is a valid name.");
}
catch
{
// An exception has been thrown, indicating that
// the name is invalid
System.Windows.Forms.MessageBox.Show(s + " is an invalid name.");
}
where "s" is the string you are testing.
See How to check if a given string is a valid name for an item in a symbol table?
Since this tool throws an exception if the name is out of compliance, I would much rather use a Regex Attribute to do the same, something like:
[RegularExpressionAttribute(#"^[a-Z]+$", ErrorMessage = "Special characters not allowed")]
But here lies my problem I am not well versed with Regex. So what would the expression be to disallow these characters:
\<>/?":;*|,=`
(spaces allowed)
Your thoughts and help are appreciated.
Matt

This expression:
[RegularExpressionAttribute(#"^[a-zA-Z \d_-]+$", ErrorMessage = "Certain special characters not allowed")]
Does seem to do the trick, I put this together, but I feel like it doesn't explicitly disallow the characters, instead, it only allows certain characters.
If there is a more concise answer I will accept it.

Related

Validate email address against invalid characters

In validating email addresses I have tried using both the EmailAddressAttribute class from System.ComponentModel.DataAnnotations:
[EmailAddress(ErrorMessage = "Invalid Email Address")]
public string Email { get; set; }
and the MailAddress class from System.Net.Mail by doing:
bool IsValidEmail(string email)
{
try {
var addr = new System.Net.Mail.MailAddress(email);
return addr.Address == email;
}
catch {
return false;
}
}
as suggested in C# code to validate email address. Both methods work in principle, they catch invalid email addresses like, e.g., user#, not fulfilling the format user#host.
My problem is that none of the two methods detect invalid characters in the user field, such as æ, ø, or å (e.g. åge#gmail.com). Is there any reason for why such characters are not returning a validation error? And do anybody have a elegant solution on how to incorporate a validation for invalid characters in the user field?

Those characters are not invalid. Unusual, but not invalid. The question you linked even contains an explanation why you shouldn't care.
Full use of electronic mail throughout the world requires that
(subject to other constraints) people be able to use close variations
on their own names (written correctly in their own languages and
scripts) as mailbox names in email addresses.
- RFC 6530, 2012

The characters you mentioned (ø, å or åge#gmail.com) are not invalid. Consider an example: When someone uses foreign language as their email id (French,German,etc.), then some unicode characters are possible. Yet EmailAddressAttribute blocks some of the unusual characters.
You can use international characters above U+007F, encoded as UTF-8.
space and "(),:;<>#[] characters are allowed with restrictions (they are only allowed inside a quoted string, a backslash or double-quote must be preceded by a backslash)
special characters !#$%&'*+-/=?^_`{|}~
Regex to validate this: Link
^(([^<>()[].,;:\s#\"]+(.[^<>()[].,;:\s#\"]+)*)|(\".+\"))#(([^<>()[].,;:\s#\"]+.)+[^<>()[].,;:\s#\"]{2,})

Strange behaviour of String.Format when (mis-)using placeholders

When I learned about the String.Format function, I did the mistake to think that it's acceptable to name the placeholders after the colon, so I wrote code like this:
String.Format("A message: '{0:message}'", "My message");
//output: "A message: 'My message'"
I just realized that the string behind the colon is used to define the format of the placeholder and may not be used to add a comment as I did.
But apparently, the string behind the colon is used for the placeholder if:
I want to fill the placeholder with an integer and
I use an unrecognized formating-string behind the colon
But this doesn't explain to me, why the string behind the colon is used for the placeholder if I provide an integer.
Some examples:
//Works for strings
String.Format("My number is {0:number}!", "10")
//output: "My number is 10!"
//Works without formating-string
String.Format("My number is {0}!", 10)
//output: "My number is 10!"
//Works with recognized formating string
String.Format("My number is {0:d}!", 10)
//output: "My number is 10!"
//Does not work with unrecognized formating string
String.Format("My number is {0:number}!", 10)
//output: "My number is number!"
Why is there a difference between the handling of strings and integers? And why is the fallback to output the formating string instead of the given value?

Just review the MSDN page about composite formatting for clarity.
A basic synopsis, the format item syntax is:
{ index[,alignment][:formatString]}
So what appears after the : colon is the formatString. Look at the "Format String Component" section of the MSDN page for what kind of format strings are predefined. You will not see System.String mentioned in that list. Which is no great surprise, a string is already "formatted" and will only ever appear in the output as-is.
Composite formatting is pretty lenient to mistakes, it won't throw an exception when you specify an illegal format string. That the one you used isn't legal is already pretty evident from the output you get. And most of all, the scheme is extensible. You can actually make a :message format string legal, a class can implement the ICustomFormatter interface to implement its own custom formatting. Which of course isn't going to happen on System.String, you cannot modify that class.
So this works as expected. If you don't get the output you expected then this is pretty easy to debug, you've just go two mistakes to consider. The debugger eliminates one (wrong argument), your eyes eliminates the other.

String.Format article on MSDN has following description:
A format item has this syntax: { index[,alignment][ :formatString] }
...
formatString Optional.
A string that specifies the format of the
corresponding argument's result string. If you omit formatString, the
corresponding argument's parameterless ToString method is called to
produce its string representation. If you specify formatString, the
argument referenced by the format item must implement the IFormattable
interface.
If we directly format the value using the IFormattable we will have the same result:
String garbageFormatted = (10 as IFormattable).ToString("garbage in place of int",
CultureInfo.CurrentCulture.NumberFormat);
Console.WriteLine(garbageFormatted); // Writes the "garbage in place of int"
So it seems that it is something close to the "garbage in, garbage out" problem in the implementation of the IFormattable interface on Int32 type(and possibly on other types as well). The String class does not implement IFormattable, so any format specifier is left unused and .ToString(IFormatProvider) is called instead.
Also:
Ildasm shows that Int32.ToString(String, INumberFormat) internally calls
string System.Number::FormatInt32(int32,
string,
class System.Globalization.NumberFormatInfo)
But it is the internalcall method (extern implemented somewhere in native code), so Ildasm is of no use if we want to determine the source of the problem.
EDIT - CULPRIT:
After reading the How to see code of method which marked as MethodImplOptions.InternalCall? I've used the source code from Shared Source Common Language Infrastructure 2.0 Release (it is .NET 2.0 but nonetheless) in attempt to find a culprit.
Code for the Number.FormatInt32 is located in the ...\sscli20\clr\src\vm\comnumber.cpp file.
The culprit could be deduced from the default section of the format switch statement of the FCIMPL3(Object*, COMNumber::FormatInt32, INT32 value, StringObject* formatUNSAFE, NumberFormatInfo* numfmtUNSAFE):
default:
NUMBER number;
Int32ToNumber(value, &number);
if (fmt != 0) {
gc.refRetString = NumberToString(&number, fmt, digits, gc.refNumFmt);
break;
}
gc.refRetString = NumberToStringFormat(&number, gc.refFormat, gc.refNumFmt);
break;
The fmt var is 0, so the NumberToStringFormat(&number, gc.refFormat, gc.refNumFmt); is being called.
It leads us to nothing else than to the second switch statement default section in the NumberToStringFormat method, that is located in the loop that enumerates every format string character. It is very simple:
default:
*dst++ = ch;
It just plain copies every character from the format string into the output array, that's how the format string ends repeated in the output.
From one point of view it allows to really use garbage format strings that will output nothing useful, but from other point of view it will allow you to use something like:
String garbageFormatted = (1234 as IFormattable).ToString("0 thousands and ### in thousand",
CultureInfo.CurrentCulture.NumberFormat);
Console.WriteLine(garbageFormatted);
// Writes the "1 thousands and 234 in thousand"
that can be handy in some situations.

Interesting behavior indeed BUT NOT unaccounted for.
Your last example works when
if String.Format("My number is {0:n}!", 10)
but revert to the observed beahvior when
if String.Format("My number is {0:nu}!", 10)`.
This prompts to search about the Standard Numeric Format Specifier article on MSDN where you can read
Standard numeric format strings are used to format common numeric
types. A standard numeric format string takes the form Axx, where:
A is a single alphabetic character called the format specifier. Any
numeric format string that contains more than one alphabetic
character, including white space, is interpreted as a custom numeric
format string. For more information, see Custom Numeric Format
Strings.
The same article explains: if you have a SINGLE letter that is not recognized you get an exception.
Indeed
if String.Format("My number is {0:K}!", 10)`.
throws the FormatException as explained.
Now looking in the Custom Numeric Format Strings chapter you will find a table of eligible letters and their possible mixings, but at the end of the table you could read
Other
All other characters
The character is copied to the result string unchanged.
So I think that you have created a format string that cannot in any way print that number because there is no valid format specifier where the number 10 should be 'formatted'.

No it's not acceptable to place anything you like after the colon. Putting anything other than a recognized format specifier is likely to result in either an exception or unpredictable behaviour as you've demonstrated. I don't think you can expect string.Format to behave consistently when you're passing it arguments which are completely inconsistent with the documented formatting types

How can I generate a safe class name from a file name?

I'm trying to produce some dynamically compiled code with the Razor engine, and I want to name the generated classes according to their source file names to help understand where a piece of generated code comes from.
For example, I would expect the file C:\source\Foo.cs to be compile with the name Foo.
Given that I have the path to the source file being compiled, is there a way to generate a valid C# identifier based on the file name?

According to the C# spec, the following rules must be adhered to when creating identifiers:
An identifier must start with a letter or an underscore
After the first character, it may contain numbers, letters, connectors, etc
If the identifier is a keyword, it must be prepended with “#”
This helper will satisfy those conditions:
private static string GenerateClassName(string value)
{
string className = CultureInfo.CurrentCulture.TextInfo.ToTitleCase(value);
bool isValid = Microsoft.CSharp.CSharpCodeProvider.CreateProvider("C#").IsValidIdentifier(className);
if (!isValid)
{
// File name contains invalid chars, remove them
Regex regex = new Regex(#"[^\p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Nd}\p{Nl}\p{Mn}\p{Mc}\p{Cf}\p{Pc}\p{Lm}]");
className = regex.Replace(className, "");
// Class name doesn't begin with a letter, insert an underscore
if (!char.IsLetter(className, 0))
{
className = className.Insert(0, "_");
}
}
return className.Replace(" ", string.Empty);
}
It first converts the file name to camel case (personal preference), it then uses IsValidIdentifier to determine if the file name is already valid for a class name.
If not, it will remove all invalid characters based on the unicode character classes. It then checks whether the file name starts with a letter, if it does, it prepends an _ to fix it.
Finally, I remove all whitespace (even though it would still be a valid identifier with it).

First, you need to extract the File-Name, for example with:
Path.GetFileNameWithoutExtension
Then you have to follow all rules, a c#-class name has.
For example
Starting with a letter or _
i would remove all other characters than _ AND a-z AND 0-9
This should be all!

did you look at the codedom - http://msdn.microsoft.com/en-us/library/ms404245(v=vs.110).aspx ?

Take the path, replace the invalid characters like \ with let's say _ and you're done.
If you prefer shorter names, you could take the path, transform it to lowercase and take a hash value.
Some code sample:
var className = pathIncludingFilename.ToLowerSinceCasingIsNotRelevant().SomeHashFunctionLikeSha1OrPartOfIt() + filename.RemoveInvalidCharactersLikeWhitespace();
The result may look like this:
123a3b6b22foo
The hash should ensure unique names, the filename makes it easier to correlate.

How can I make this regex match correctly?

Given this regex:
^((https?|ftp):(\/{2}))?(((25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}
(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))|(((([a-zA-Z0-9]+)(\.)*?))(\.)([a-z]{2}
|com|org|net|gov|mil|biz|info|mobi|name|aero|jobs|museum){1})
Reformatted for readability:
#"^((https?|ftp):(\/{2}))?" + // http://, https://, ftp:// - Protocol Optional
#"(" + // Begin URL payload format section
#"((25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)" + // IPv4 Address support
#")|("+ // Delimit supported payload types
#"((([a-zA-Z0-9]+)(\.)*?))(\.)([a-z]{2}|com|org|net|gov|mil|biz|info|mobi|name|aero|jobs|museum){1}" + // FQDNs
#")"; // End URL payload format section
How can I make it fail (i.e. not match) on this "fail" test case?
http://www.google
As I am specifying {1} on the TLD section, I would think it would fail without the extension. Am I wrong?
Edit: These are my PASS conditions:
"http://www.zi255.com?Req=Post&PID=4",
"http://www.zi255.com?Req=Post&ID=4",
"http://www.zi255.com/?Req=Post&PID=4",
"http://www.zi255.com?Req=Post&PostID=4",
"http://www.zi255.com/?Req=Post&ID=4"
"http://www.zi255.com?Req=Post&Post=4",
"http://www.zi255.com?Req=Post&Entry=4",
"http://www.zi255.com?PID=4"
"http://www.zi255.com/Post.aspx?Req=Post&ID=4",
"http://www.zi255.com/Post.aspx?Req=Post&PID=4",
"http://www.zi255.com/Post.aspx?Req=Post&Post=4",
"http://www.zi255.com/Post.aspx?Req=Post&Title=Random%20Post%20Name"
"http://www.zi255.com/?Req=Post&Title=Random%20Post%20Name",
"http://www.zi255.com?Req=Post&Title=Random%20Post%20Name",
"http://www.zi255.com?Req=Post&PostID=4",
"http://www.zi255.com?Req=Post&Post=4",
"http://www.zi255.com?Req=Post&Entry=4",
"http://www.zi255.com?PID=4"
"http://www.zi255.com",
"http://www.damnednice.com"
These are my FAIL conditions:
"http://.com",
"http://.com/",
"http:/www.google.com",
"http:/www.google.com/",
"http://www.google",
"http://www.googlecom",
"http://www.google.c",
".com",
"https://www..."

I'll throw out an alternative suggestion. You may want to use a combination of the parsing of the built-in System.Uri class and a couple targeted regexes (or simple string checks when appropriate).
Example:
string uriString = "...";
Uri uri;
if (!Uri.TryCreate(uriString, UriKind.Absolute, out uri))
{
// Uri is totally invalid!
}
else
{
// validate the scheme
if (!uri.Scheme.Equals("http", StringComparison.OrdinalIgnoreCase))
{
// not http!
}
// validate the authority ('www.blah.com:1234' portion)
if (uri.Authority // ...)
{
}
// ...
}

Sometimes, one catch-all reqex is not the best solution, however tempting. While debugging this regex is feasible (see Greg Hewgills answer), consider doing a couple of tests for different categories of problems, e.g. one test for numerical addresses and one test for named addresses.

You need to force your regex to match up until the end of the string. Add a $ at the very end of it. Otherwise, your regex is probably just matching http://, or something else shorter than your whole string.

The "validate a url" problem has been solved* numerous times. I suggest you use the System.Uri class, it validates more cases than you can shake a stick at.
The code Uri uri = new Uri("http://whatever"); throws a UriFormatException if it fails validation. That is probably what you'd want.
*) Or kind of solved. It's actually pretty tricky to define what is a valid url.

Its all about definitions, a "valid url" should provide you with a IP address when you do a DNS Lookup. The IP should be connected to and when a request is send out, you get a reply in the form of a HTML information that you can use.
So what we are looking for is a "valid URL Format" and that is where the system.uri comes in very handy. BUT, if the URL is hidden in a large piece of tekst, you would first like to find something that validates as a valid URL-Format.
The thing that distinquishes a URL from any given readable tekst is the dot not followed by whitespace. "123.com" could validate as a real URL.
Using the regex
[a-z_\.\-0-9]+\.[a-z]+[^ ]*
to find any possible valid url in a text and then do a system.uri check to see if its a valid URL format and then do a lookup. Only when the lookup gives you a result then you know the URL is valid.

Regular expressions in C# for file name validation

What is a good regular expression that can validate a text string to make sure it is a valid Windows filename? (AKA not have \/:*?"<>| characters).
I'd like to use it like the following:
// Return true if string is invalid.
if (Regex.IsMatch(szFileName, "<your regex string>"))
{
// Tell user to reformat their filename.
}

As answered already, GetInvalidFileNameChars should do it for you, and you don't even need the overhead of regular expressions:
if (proposedFilename.IndexOfAny(System.IO.Path.GetInvalidFileNameChars()) != -1)
{
MessageBox.Show("The filename is invalid");
return;
}

This isn't as simple as just checking whether the file name contains any of System.IO.Path.GetInvalidFileNameChars (as mentioned in a couple of other answers already).
For example what if somebody enters a name that contains no invalid chars but is 300 characters long (i.e. greater than MAX_PATH) - this won't work with any of the .NET file APIs, and only has limited support in the rest of windows using the \?\ path syntax. You need context as to how long the rest of the path is to determine how long the file name can be. You can find more information about this type of thing here.
Ultimately all your checks can reliably do is prove that a file name is not valid, or give you a reasonable estimate as to whether it is valid. It's virtually impossible to prove that the file name is valid without actually trying to use it. (And even then you have issues like what if it already exists? It may be a valid file name, but is it valid in your scenario to have a duplicate name?)

Why not using the System.IO.FileInfo class, together with the DirectoryInfo class you have a set of usefull methods.

Path.GetInvalidFileNameChars - Is not a good way. Try this:
if(#"C:\A.txt".IndexOfAny(System.IO.Path.GetInvalidFileNameChars()) != -1)
{
MessageBox.Show("The filename is invalid");
return;
}

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

RegularExpressionAttribute Equivalent for SymbolUtilityServices.ValidateSymbolName - c#

Related

Validate email address against invalid characters

Strange behaviour of String.Format when (mis-)using placeholders

How can I generate a safe class name from a file name?

How can I make this regex match correctly?

Regular expressions in C# for file name validation

Categories

Resources