Regular expression for valid filename

Regular expression for valid filename - c#

I already gone through some question in StackOverflow regarding this but nothing helped much in my case.
I want to restrict the user to provide a filename that should contain only alphanumeric characters, -, _, . and space.
I'm not good in regular expressions and so far I came up with this ^[a-zA-Z0-9.-_]$. Can somebody help me?

This is the correct expression:
string regex = #"^[\w\-. ]+$";
\w is equivalent of [0-9a-zA-Z_].

To validate a file name i would suggest using the function provided by C# rather than regex
if (filename.IndexOfAny(System.IO.Path.GetInvalidFileNameChars()) != -1)
{
}

While what the OP asks is close to what the currently accepted answer uses (^[\w\-. ]+$), there might be others seeing this question who has even more specific constraints.
First off, running on a non-US/GB machine, \w will allow a wide range of unwanted characters from foreign languages, according to the limitations of the OP.
Secondly, if the file extension is included in the name, this allows all sorts of weird looking, though valid, filenames like file .txt or file...txt.
Thirdly, if you're simply uploading the files to your file system, you might want a blacklist of files and/or extensions like these:
web.config, hosts, .gitignore, httpd.conf, .htaccess
However, that is considerably out of scope for this question; it would require all sorts of info about the setup for good guidance on security issues. I thought I should raise the matter none the less.
So for a solution where the user can input the full file name, I would go with something like this:
^[a-zA-Z0-9](?:[a-zA-Z0-9 ._-]*[a-zA-Z0-9])?\.[a-zA-Z0-9_-]+$
It ensures that only the English alphabet is used, no beginning or trailing spaces, and ensures the use of a file extension with at least 1 in length and no whitespace.
I've tested this on Regex101, but for future reference, this was my "test-suite":
## THE BELOW SHOULD MATCH
web.config
httpd.conf
test.txt
1.1
my long file name.txt
## THE BELOW SHOULD NOT MATCH - THOUGH VALID
æøå.txt
hosts
.gitignore
.htaccess

In case someone else needs to validate filenames (including Windows reserved words and such), here's a full expression:
\A(?!(?:COM[0-9]|CON|LPT[0-9]|NUL|PRN|AUX|com[0-9]|con|lpt[0-9]|nul|prn|aux)|[\s\.])[^\\\/:*"?<>|]{1,254}\z
Extended expression (don't allow filenames starting with 2 dots, don't allow filenames ending in dots or whitespace):
\A(?!(?:COM[0-9]|CON|LPT[0-9]|NUL|PRN|AUX|com[0-9]|con|lpt[0-9]|nul|prn|aux)|\s|[\.]{2,})[^\\\/:*"?<>|]{1,254}(?<![\s\.])\z
Edit:
For the interested, here's a link to Windows file naming conventions:
https://msdn.microsoft.com/en-us/library/windows/desktop/aa365247(v=vs.85).aspx

use this regular expression ^[a-zA-Z0-9._ -]+$

This is a minor change to Engineers answer.
string regex = #"^[\w\- ]+[\w\-. ]*$"
This will block ".txt" which isn't valid.
Trouble is, it does block "..txt" which is valid

For full character set (Unicode) use
^[\p{L}0-9_\-.~]+$
or perhaps
^[\p{L}\p{N}_\-.~]+$
would be more accurate if we are talking about Unicode.
I added a '~' simply because I have some files using that character.

I've just created this. It prevents two dots and dot at end and beginning. It doesn't allow any two dots though.
^([a-zA-Z0-9_]+)\.(?!\.)([a-zA-Z0-9]{1,5})(?<!\.)$

When used in HTML5 via pattern:
<form action="" method="POST">
<fieldset>
<legend>Export Configuration</legend>
<label for="file-name">File Name</label>
<input type="text" required pattern="^[\w\-. ]+$" id="file-name" name="file_name"/>
</fieldset>
<button type="submit">Export Settings</button>
</form>
This will validate against all valid file names. You can remove required to prevent the native HTML5 validation.

I may be saying something stupid here, but it seems to me that these answers aren't correct. Firstly, are we talking Linux or Windows here (or another OS)?
Secondly, in Windows it is (I believe) perfectly legitimate to include a "$" in a filename, not to mention Unicode in general. It certainly seems possible.
I tried to get a definitive source on this... and ending up at the Wikip Filename page: in particular the section "Reserved characters and words" seems relevant: and these are, clearly, a list of things which you are NOT allowed to put in.
I'm in the Java world. And I naturally assumed that Apache Commons would have something like validateFilename, maybe in FilenameUtils... but it appears not (if it had done, this would still be potentially useful to C# programmers, as the code is usually pretty easy to understand, and could therefore be translated). I did do an experiment, though, using the method normalize: to my disappointment it allowed perfectly invalid characters (?, etc.) to "pass".
The part of the Wikip Filename page referenced above shows that this question depends on the OS you're using... but it should be possible to concoct some simple regex for Linux and Windows at least.
Then I found a Java way (at least):
Path path = java.nio.file.FileSystems.getDefault().getPath( 'bobb??::mouse.blip' );
output:
java.nio.file.InvalidPathException: Illegal char at index 4:
bobb??::mouse.blip
... presumably different FileSystem objects will have different validation rules

Copied from #Engineer for future reference as the dot was not escaped (as it should) in the most voted answer.
This is the correct expression:
string regex = #"^[\w\-\. ]+$";

Related

C# Regex filter problems

At this moment in time, i posted something earlier asking about the same type of question regarding Regex. It has given me headaches, i have looked up loads of documentation of how to use regex but i still could not put my finger on it. I wouldn't want to waste another 6 hours looking to filter simple (i think) expressions.
So basically what i want to do is filter all filetypes with the endings of HTML extensions (the '*' stars are from a Winforms Tabcontrol signifying that the file has been modified. I also need them in IgnoreCase:
.html, .htm, .shtml, .shtm, .xhtml
.html*, .htm*, .shtml*, .shtm*, .xhtml*
Also filtering some CSS files:
.css
.css*
And some SQL Files:
.sql, .ddl, .dml
.sql*, .ddl*, .dml*
My previous question got an answer to filtering Python files:
.py, .py, .pyi, .pyx, .pyw
Expression would be: \.py[3ixw]?\*?$
But when i tried to learn from the expression above i would always end up with opening a .xhtml only, the rest are not valid.
For the HTML expression, i currently have this: \.html|.html|.shtml|.shtm|.xhtml\*?$ with RegexOptions.IgnoreCase. But the output will only allow .xhtml case sensitive or insensitive. .html files, .htm and the rest did not match. I would really appreciate an explanation to each of the expressions you provide (so i don't have to ask the same question ever again).
Thank you.

For such cases you may start with a simple regex that can be simplified step by step down to a good regex expression:
In C# this would basically, with IgnoreCase, be
Regex myRegex = new Regex("PATTERN", RegexOptions.IgnoreCase);
Now the pattern: The most easy one is simply concatenating all valid results with OR + escaping (if possible):
\.html|\.htm|\.shtml|\.shtm|\.xhtml|\.html*|\.htm*|\.shtml*|\.shtm*|\.xhtml*
With .html* you mean .html + anything, which is written as .*(Any character, 0-infinite times) in regex.
\.html|\.htm|\.shtml|\.shtm|\.xhtml|\.html.*|\.htm.*|\.shtml.*|\.shtm.*|\.xhtml.*
Then, you may take all repeating patterns and group them together. All file endings start with a dot and may have an optional end and ending.* always contains ending:
\.(html|htm|shtml|shtm|xhtml).*
Then, I see htm pretty often, so I try to extract that. Taking all possible characters before and after htm together (? means 0 or 1 appearance):
\.(s|x)?(htm)l?.*
And, I always check if it's still working in regexstorm for .Net
That way, you may also get regular expressions for the other 2 ones and concat them all together in the end.

CamelCase conversion to friendly name, i.e. Enum constants; Problems?

In my answer to this question, I mentioned that we used UpperCamelCase parsing to get a description of an enum constant not decorated with a Description attribute, but it was naive, and it didn't work in all cases. I revisited it, and this is what I came up with:
var result = Regex.Replace(camelCasedString,
#"(?<a>(?<!^)[A-Z][a-z])", #" ${a}");
result = Regex.Replace(result,
#"(?<a>[a-z])(?<b>[A-Z0-9])", #"${a} ${b}");
The first Replace looks for an uppercase letter, followed by a lowercase letter, EXCEPT where the uppercase letter is the start of the string (to avoid having to go back and trim), and adds a preceding space. It handles your basic UpperCamelCase identifiers, and leading all-upper acronyms like FDICInsured.
The second Replace looks for a lowercase letter followed by an uppercase letter or a number, and inserts a space between the two. This is to handle special but common cases of middle or trailing acronyms, or numbers in an identifier (except leading numbers, which are usually prohibited in C-style languages anyway).
Running some basic unit tests, the combination of these two correctly separated all of the following identifiers: NoDescription, HasLotsOfWords, AAANoDescription, ThisHasTheAcronymABCInTheMiddle, MyTrailingAcronymID, TheNumber3, IDo3Things, IAmAValueWithSingleLetterWords, and Basic (which didn't have any spaces added).
So, I'm posting this first to share it with others who may find it useful, and second to ask two questions:
Anyone see a case that would follow common CamelCase-ish conventions, that WOULDN'T be correctly separated into a friendly string this way? I know it won't separate adjacent acronyms (FDICFCUAInsured), recapitalize "properly" camelCased acronyms like FdicInsured, or capitalize the first letter of a lowerCamelCased identifier (but that one's easy to add - result = Regex.Replace(result, "^[a-z]", m=>m.ToString().ToUpper());). Anything else?
Can anyone see a way to make this one statement, or more elegant? I was looking to combine the Replace calls, but as they do two different things to their matches it can't be done with these two strings. They could be combined into a method chain with a RegexReplace extension method on String, but can anyone think of better?

So while I agree with Hans Passant here, I have to say that I had to try my hand at making it one regex as an armchair regex user.
(?<a>(?<!^)((?:[A-Z][a-z])|(?:(?<!^[A-Z]+)[A-Z0-9]+(?:(?=[A-Z][a-z])|$))|(?:[0-9]+)))
Is what I came up with. It seems to pass all the tests you put forward in the question.
So
var result = Regex.Replace(camelCasedString, #"(?<a>(?<!^)((?:[A-Z][a-z])|(?:(?<!^[A-Z]+)[A-Z0-9]+(?:(?=[A-Z][a-z])|$))|(?:[0-9]+)))", #" ${a}");
Does it in one pass.

not that this directly answers the question, but why not test by taking the standard C# API and converting each class into a friendly name? It'd take some manual verification, but it'd give you a good list of standard names to test.

Let's say every case you come across works with this (you're asking us for examples that won't and then giving us some, so you don't even have a question left).
This still binds UI to programmatic identifiers in a way that will make both programming and UI changes brittle.
It still assumes your program will only be used in one language. Either your potential market it so small that just indexing an array of names would be scalable enough (e.g. a one-client bespoke or in-house project), or you are assuming you will never be successful enough to need to be available to other languages or other dialects of your first-chosen language.
Does "well, it'll work as long as we're a failure" sound like a passing grade in balancing designs?
Either code it to use resources, or else code it to pass the enum name blindly or use an array of names, as that at least will be modifiable afterwards.

Culture specific characters to nice URL format

I need some functionality to make the following string in a url-friendly format:
"knæ som gør" should be "kna-som-gor"
That is, replacing culture specific characters to characters that can be used in urls.
Using .Net and C#
Please help me :)
/Andreas

Don't complicate things. :)
Either use a regexp, or simply use String.Replace.

You can find a solution that removes diacritics here: How do I remove diacritics (accents) from a string in .NET?. This solution does not help you with æ or ø, though.
Maybe that removes enough of your special characters that the rest can be translated using simple replacing?
If "url-friendly" does not mean pretty, you could also use HttpUtility.UrlEncode, which produces
"kn%c3%a6+som+g%c3%b8r".

Edit: Added possible solution (end of post).
I had a very similar problem, albeit for file names rather than URLs. The main problem seems to be that there is no standard way to ask for the "best ASCII replacement for ø", so even if you can locate all the unwanted characters it is hard to automate which replacement to insert.
I posted quite a bit of code that might be helpful. See this StackOverflow question for details.
Edit: I think the solution to this problem lies with StringInfo, which allows you to iterate through the sub-characters (Unicode surrogates or combining characters) in a string. This should make it possible to detect and convert something like å (which can be encoded in Unicode as either A-WITH-RING or RINGED-A; filter out the decorator and keep the part that is a normal character).

Verifying that an uploaded file contains only plain text

I have an ASP.NET MVC application that allows the user to upload a file that should only contain plain text.
I am looking for a simple approach to validate that the file does indeed contain only text.
For my purposes I am happy to define text as any of the characters that I can see printed on my GB QWERTY keyboard.
Business rules mean that my uploaded file won't contain any accented characters, so it doesn't matter if the code accepts or rejects these.
Approaches so far that have not worked:
Checking the content-type; no good as this is dependant on the file extension
Checking char.IsControl for each character; no good as the file can contain pipe (|) characters which are considered to be control characters
I'd rather avoid using a lengthy Regex pattern to get this to work.

It sounds like you want ASCII characters 32-126 plus a few odds and ends like 9 (horizontal tab), carriage return & linefeed, etc..
I'd rather avoid using a lengthy Regex
pattern to get this to work.
As long as that doesn't mean 'no regular expressions at all', you can use the accepted answer from this stack overflow question (I've added the horizontal tab character to the original):
^([^\x09\x0d\x0a\x20-\x7e\t]*)$

Regular expression to define format of backup filenames

In the application I am currently working on, I have an option to create automatic backups of a certain file on the hard disk. What I would like to do is offer the user the possibility to configure the name of the file and its extension.
For example, the backup filename could be something like : "backup_month_year_username.bak". I had the idea to save the format in the form of a regular expression. For the example above, the regexp would look like :
"^backup_(?<Month>\d{2})_(?<Year>\d{2})_(?<Username>\w).(?<extension>bak)$"
I thought about using regex because I will also have to browse through the directory of backuped files to delete those older than a certain date. The main trouble I have now is how to create a filename using the regex. In a way I should replace the tags with the information. I could do that using regex.replace and another regex, but I feel it's a big weird doing that and it might be a better way.
Thanks
[Edit] Maybe I wasn't really clear in the first go, but the idea is of course that the user (in this case an admin that will know regex syntax) will have the possibility to modify the form of the filename, that's all the idea behind it[/Edit]

... and if the regex changes, it is next to impossible to reconstruct a string from a given regex.
Edit:
Create some predefined "place-holders": %u could be the user's name, %y could be the year, etc.:
backup_%m_%y_%u.bak
and then simple replace the %? with their actual values.

It sounds like you're trying to use the regular expression to create the file name from a pattern which the user should be able to specify.
Regular expressions can - AFAIK - not be used to create output, but only to validate input, so you'd have the user specify two things:
a file name production pattern like Bart suggested
a validation pattern in form of a regular expression that helps you split the file names into their parts
EDIT
By the way, your sample regex contains an error: The "." is use for "any character", also \w only matches one word character, so I guess you meant to write
"^backup_(?<Month>\d{2})_(?<Year>\d{2})_(?<Username>\w+)\.(?<extension>bak)$"

If the filename is always in this form, there is no reason for a regex, as it's easier to process with string.Split ...

With Bart's solution it is easy enough to split (using string.Split) the generated file name using underscore as the delimiter, to get back the information.

Ok, I think I have found a way to use only the regex. As I am using groups to get the information, I will use another regular expression to match the regular expression and replace the groups with the value:
Regex rgx = new Regex("\(\?\<Month\>.+?\)");
rgx.Replace("^backup_(?<Month>\d{2})_(?<Year>\d{2})_(?<Username>\w+)\.(?<extension>bak)$"
, DateTime.Now.Month.ToString());
Ok, it's really a hack, but at least it works and I have only one pattern defined by the user. It might not work if the regex is too complex, but I think I can deal with that problem.
What do you think?

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Regular expression for valid filename - c#

This is the correct expression: string regex = #"^[\w\-. ]+$"; \w is equivalent of [0-9a-zA-Z_].

To validate a file name i would suggest using the function provided by C# rather than regex if (filename.IndexOfAny(System.IO.Path.GetInvalidFileNameChars()) != -1) { }

use this regular expression ^[a-zA-Z0-9._ -]+$

This is a minor change to Engineers answer. string regex = #"^[\w\- ]+[\w\-. ]*$" This will block ".txt" which isn't valid. Trouble is, it does block "..txt" which is valid

For full character set (Unicode) use ^[\p{L}0-9_\-.~]+$ or perhaps ^[\p{L}\p{N}_\-.~]+$ would be more accurate if we are talking about Unicode. I added a '~' simply because I have some files using that character.

I've just created this. It prevents two dots and dot at end and beginning. It doesn't allow any two dots though. ^([a-zA-Z0-9_]+)\.(?!\.)([a-zA-Z0-9]{1,5})(?<!\.)$

Copied from #Engineer for future reference as the dot was not escaped (as it should) in the most voted answer. This is the correct expression: string regex = #"^[\w\-\. ]+$";

Related

C# Regex filter problems

CamelCase conversion to friendly name, i.e. Enum constants; Problems?

Culture specific characters to nice URL format

Verifying that an uploaded file contains only plain text

Regular expression to define format of backup filenames

Categories

Resources