C# Regex filter problems

C# Regex filter problems - c#

At this moment in time, i posted something earlier asking about the same type of question regarding Regex. It has given me headaches, i have looked up loads of documentation of how to use regex but i still could not put my finger on it. I wouldn't want to waste another 6 hours looking to filter simple (i think) expressions.
So basically what i want to do is filter all filetypes with the endings of HTML extensions (the '*' stars are from a Winforms Tabcontrol signifying that the file has been modified. I also need them in IgnoreCase:
.html, .htm, .shtml, .shtm, .xhtml
.html*, .htm*, .shtml*, .shtm*, .xhtml*
Also filtering some CSS files:
.css
.css*
And some SQL Files:
.sql, .ddl, .dml
.sql*, .ddl*, .dml*
My previous question got an answer to filtering Python files:
.py, .py, .pyi, .pyx, .pyw
Expression would be: \.py[3ixw]?\*?$
But when i tried to learn from the expression above i would always end up with opening a .xhtml only, the rest are not valid.
For the HTML expression, i currently have this: \.html|.html|.shtml|.shtm|.xhtml\*?$ with RegexOptions.IgnoreCase. But the output will only allow .xhtml case sensitive or insensitive. .html files, .htm and the rest did not match. I would really appreciate an explanation to each of the expressions you provide (so i don't have to ask the same question ever again).
Thank you.

For such cases you may start with a simple regex that can be simplified step by step down to a good regex expression:
In C# this would basically, with IgnoreCase, be
Regex myRegex = new Regex("PATTERN", RegexOptions.IgnoreCase);
Now the pattern: The most easy one is simply concatenating all valid results with OR + escaping (if possible):
\.html|\.htm|\.shtml|\.shtm|\.xhtml|\.html*|\.htm*|\.shtml*|\.shtm*|\.xhtml*
With .html* you mean .html + anything, which is written as .*(Any character, 0-infinite times) in regex.
\.html|\.htm|\.shtml|\.shtm|\.xhtml|\.html.*|\.htm.*|\.shtml.*|\.shtm.*|\.xhtml.*
Then, you may take all repeating patterns and group them together. All file endings start with a dot and may have an optional end and ending.* always contains ending:
\.(html|htm|shtml|shtm|xhtml).*
Then, I see htm pretty often, so I try to extract that. Taking all possible characters before and after htm together (? means 0 or 1 appearance):
\.(s|x)?(htm)l?.*
And, I always check if it's still working in regexstorm for .Net
That way, you may also get regular expressions for the other 2 ones and concat them all together in the end.

Related

Too short control escape. How to get Regex for this?

So, let's say I have a result from a search that comes back as:
\\my.test.site#SSL\JohnDoe\SusanSmith\courses\PDFs\Science_Math\BIOL\S12014 Syllabi\BIOL-1322-S12014-John-Doe.pdf
Whenever the result is listed in a text box I get the entire path instead of just the file. This is functioning as designed since I can't use the .Select(Path.GetFileName) while enumerating directories lest it doesn't have the full path to do the search on.
So, I was going to use Regex to do a replace at the end when the results are displayed however when I went to Rubular it doesn't like either my expression or the test string(can't figure out which).
I basically want to cut down everything except the file name and extension.
So my Regex was supposed to be something like:
\\my.test.site#SSL\JohnDoe\SusanSmith\courses\PDFs\.+\.+\.+\
So that I get everything up to the file name and extension for deletion. However Rubular doesn't like something as I get a "too short control escape" error. I don't want to test this in C# without verifying in Rubular since I use it heavily and figure if it won't work there it won't work at runtime.
Any ideas? Thanks.

Remember to escape the \ characters, as well as the literal . characters:
\\\\my\.test\.site#SSL\\JohnDoe\\SusanSmith\\courses\\PDFs\\.+\\.+\\.+\\
Also note, you probably want to avoid over-matching on the .+ by using non-greedy quantifiers:
\\\\my\.test\.site#SSL\\JohnDoe\\SusanSmith\\courses\\PDFs\\.+?\\.+?\\.+?\\
Or using character classes:
\\\\my\.test\.site#SSL\\JohnDoe\\SusanSmith\\courses\\PDFs\\[^\\]+\\[^\\]+\\[^\\]+\\

Maybe I'm misinterpreting the question, but it sounds like your approach has been overly complicated.
Can't you simply match this: .+\\
And then replace with '' (nothing)?

Regular expression for valid filename

I already gone through some question in StackOverflow regarding this but nothing helped much in my case.
I want to restrict the user to provide a filename that should contain only alphanumeric characters, -, _, . and space.
I'm not good in regular expressions and so far I came up with this ^[a-zA-Z0-9.-_]$. Can somebody help me?

This is the correct expression:
string regex = #"^[\w\-. ]+$";
\w is equivalent of [0-9a-zA-Z_].

To validate a file name i would suggest using the function provided by C# rather than regex
if (filename.IndexOfAny(System.IO.Path.GetInvalidFileNameChars()) != -1)
{
}

While what the OP asks is close to what the currently accepted answer uses (^[\w\-. ]+$), there might be others seeing this question who has even more specific constraints.
First off, running on a non-US/GB machine, \w will allow a wide range of unwanted characters from foreign languages, according to the limitations of the OP.
Secondly, if the file extension is included in the name, this allows all sorts of weird looking, though valid, filenames like file .txt or file...txt.
Thirdly, if you're simply uploading the files to your file system, you might want a blacklist of files and/or extensions like these:
web.config, hosts, .gitignore, httpd.conf, .htaccess
However, that is considerably out of scope for this question; it would require all sorts of info about the setup for good guidance on security issues. I thought I should raise the matter none the less.
So for a solution where the user can input the full file name, I would go with something like this:
^[a-zA-Z0-9](?:[a-zA-Z0-9 ._-]*[a-zA-Z0-9])?\.[a-zA-Z0-9_-]+$
It ensures that only the English alphabet is used, no beginning or trailing spaces, and ensures the use of a file extension with at least 1 in length and no whitespace.
I've tested this on Regex101, but for future reference, this was my "test-suite":
## THE BELOW SHOULD MATCH
web.config
httpd.conf
test.txt
1.1
my long file name.txt
## THE BELOW SHOULD NOT MATCH - THOUGH VALID
æøå.txt
hosts
.gitignore
.htaccess

In case someone else needs to validate filenames (including Windows reserved words and such), here's a full expression:
\A(?!(?:COM[0-9]|CON|LPT[0-9]|NUL|PRN|AUX|com[0-9]|con|lpt[0-9]|nul|prn|aux)|[\s\.])[^\\\/:*"?<>|]{1,254}\z
Extended expression (don't allow filenames starting with 2 dots, don't allow filenames ending in dots or whitespace):
\A(?!(?:COM[0-9]|CON|LPT[0-9]|NUL|PRN|AUX|com[0-9]|con|lpt[0-9]|nul|prn|aux)|\s|[\.]{2,})[^\\\/:*"?<>|]{1,254}(?<![\s\.])\z
Edit:
For the interested, here's a link to Windows file naming conventions:
https://msdn.microsoft.com/en-us/library/windows/desktop/aa365247(v=vs.85).aspx

use this regular expression ^[a-zA-Z0-9._ -]+$

This is a minor change to Engineers answer.
string regex = #"^[\w\- ]+[\w\-. ]*$"
This will block ".txt" which isn't valid.
Trouble is, it does block "..txt" which is valid

For full character set (Unicode) use
^[\p{L}0-9_\-.~]+$
or perhaps
^[\p{L}\p{N}_\-.~]+$
would be more accurate if we are talking about Unicode.
I added a '~' simply because I have some files using that character.

I've just created this. It prevents two dots and dot at end and beginning. It doesn't allow any two dots though.
^([a-zA-Z0-9_]+)\.(?!\.)([a-zA-Z0-9]{1,5})(?<!\.)$

When used in HTML5 via pattern:
<form action="" method="POST">
<fieldset>
<legend>Export Configuration</legend>
<label for="file-name">File Name</label>
<input type="text" required pattern="^[\w\-. ]+$" id="file-name" name="file_name"/>
</fieldset>
<button type="submit">Export Settings</button>
</form>
This will validate against all valid file names. You can remove required to prevent the native HTML5 validation.

I may be saying something stupid here, but it seems to me that these answers aren't correct. Firstly, are we talking Linux or Windows here (or another OS)?
Secondly, in Windows it is (I believe) perfectly legitimate to include a "$" in a filename, not to mention Unicode in general. It certainly seems possible.
I tried to get a definitive source on this... and ending up at the Wikip Filename page: in particular the section "Reserved characters and words" seems relevant: and these are, clearly, a list of things which you are NOT allowed to put in.
I'm in the Java world. And I naturally assumed that Apache Commons would have something like validateFilename, maybe in FilenameUtils... but it appears not (if it had done, this would still be potentially useful to C# programmers, as the code is usually pretty easy to understand, and could therefore be translated). I did do an experiment, though, using the method normalize: to my disappointment it allowed perfectly invalid characters (?, etc.) to "pass".
The part of the Wikip Filename page referenced above shows that this question depends on the OS you're using... but it should be possible to concoct some simple regex for Linux and Windows at least.
Then I found a Java way (at least):
Path path = java.nio.file.FileSystems.getDefault().getPath( 'bobb??::mouse.blip' );
output:
java.nio.file.InvalidPathException: Illegal char at index 4:
bobb??::mouse.blip
... presumably different FileSystem objects will have different validation rules

Copied from #Engineer for future reference as the dot was not escaped (as it should) in the most voted answer.
This is the correct expression:
string regex = #"^[\w\-\. ]+$";

C# regex html table inside a table

I am using the follow regex:
(<(table|h[1-6])[^>]*>(?<op>.+?)<\/(table|h[1-6])>)
to extract tables (and headings) from a html document.
I've found it to work quite well in the documents we are using (documents converted with word save as filtered html), however I have a problem that if the table contains a table inside it the regex will match the initial table start tag and the second table end tag rather than the initial table end tag.
Is there a way in regex to specify that if it finds another table tag within the match to keep to ignore the next match of and go for the next one and so on?

Don't do this.
HTML is not a regular grammar and so a regular expression is not a good tool with which to parse it. What you are asking in your last sentence is for a contextual parser, not a regular expression. Bare regular expression parsing it is too likely fail to parse HTML correctly to be responsible coding.
HtmlAgilityPack is a MsPL-licensed solution I've used in the past that has widely acceptable license terms and provides a well-formed DOM which can be probed with XPath or manipulated in other useful ways ("Extract all text, dropping out tags" being a popular one for importing HTML mail for search, for example, that is nigh trivial after letting a DOM parser rip through the HTML and only coding the part that adds value for your specific business case).

Is there a way in regex to specify
that if it finds another table tag
within the match to keep to ignore the
next match of and go for the next one
and so on?
Since nobody's actually answered this part, I will—No.
This is part of what makes regular languages "regular". A regular language is one that can be recognized by a certain regular grammar, often described in syntax that looks very much like basic regular expressions (10* to match 1 followed by any number of 0s), or a DFA. "Regular Expressions" are based strongly off of these regular languages, as their name implies, but add some functions such as lookaheads and lookbehinds. As a general rule, a regular language knows nothing about what's around it or what it's seen, only what it's looking at currently, and which of its finite states it's in.
TLDNR: Why does this matter to you? Since a regular language cannot "count" elements in that way, it is impossible to keep a tally of the number of <table> and </table> elements you have seen. An HTML Parser does just that - since it is not trying to emulate a regular language, it can count the number of opening and closing tags it sees.
This is the prime example of why it's best not to use regular expressions to parse HTML; even though you know how it may be formed, you cannot parse it since there may be nested elements. If you could guarantee there would be no nested tables, it may be feasible to do this, but even then, using a parser would be much simpler.
Plea to the theoretical computer scientists: I did my best to explain what I know from the CS Theory classes I've taken in a way that most people here should be able to understand. I know that regular languages can "count" finite numbers of things. Feel free to correct me, but please be kind!

Regular expressions are not really suited for this as what you're trying to do contains knowledge about the fact that this is a nested language. Without this knowledge it will be really hard (and also hard to read and maintain) to extract this information.
Maybe do something with an XPath navigator?

Is it possible to use Regex to extract text from attributes repeated in a text file - c# .NET

I am working something at the moment and need to extract an attribute from a big list tags, they are formatted like this:
<appid="928" appname="extractapp" supportemail="me#mydomain.com" /><appid="928" appname="extractapp" supportemail="me#mydomain.com" />
The tags are repeated one after another and all have different appid, appname, supportemail.
I need to just extract all of the support emails, just the email address, without the supportemail=
Will I need to use two regex statements, one to seperate each individual tag, then loop through the result and pull out the emails?
I would then go through and Add the emails to a list, then loop through the list and write each one to a txt file, with a comma after it.
I've never really used Regex too much, so don't know if it's suitable for the above?
I would spend more time trying it myself but it's quite urgent. So hopefully somebody can help.

Have you considered Linq to XML?
http://www.hookedonlinq.com/LINQtoXML5MinuteOverview.ashx

Using XML is better, perhaps, but here's the regular expression you'd use (in case there's a particular reason you need/want to use regular expressions to read XML):
(appid="(?<AppID>[^"]+)" appname="(?<AppName>[^"]+)" supportemail="(?<SupportEmail>[^"]+)")
You can just take the last bit there for the support email but this will extract all of the attributes you mentioned and they will be "grouped" within each tag.

What about modify the string to have proper xml format and load xml to extract all the values of supportemail attribute?

Use
string pattern = "supportemail=\"([^\"]+)";
MatchCollection matches = Regex.Matches(inputString, pattern);
foreach(Match m in matches)
Console.WriteLine(m.Groups[1].Value);
See it here.

Problems you'll encounter by using regular expressions instead of an XML DOM:
All of the example regexes posted thus far will fail in the extremely common case that the attribute values are delimited by single quotes.
Any regex that depends on the attributes appearing in a specific order (e.g. appId before appName) will fail in the event that attributes - whose ordering is insignificant to XML - appear in an order different from what the regex expects.
A DOM will resolve entity references for you and a regex will not; if you use regex, you must check the returned values for (at least) the XML character entitites &, &apos;, >, <, and ".
There's a well-known edge case where using regular expressions to parse XML and XHTML unleashes the Great Old Ones. This will complicate your task considerably, as you will be reduced to gibbering madness and then the Earth will be eaten.

Regular expression to define format of backup filenames

In the application I am currently working on, I have an option to create automatic backups of a certain file on the hard disk. What I would like to do is offer the user the possibility to configure the name of the file and its extension.
For example, the backup filename could be something like : "backup_month_year_username.bak". I had the idea to save the format in the form of a regular expression. For the example above, the regexp would look like :
"^backup_(?<Month>\d{2})_(?<Year>\d{2})_(?<Username>\w).(?<extension>bak)$"
I thought about using regex because I will also have to browse through the directory of backuped files to delete those older than a certain date. The main trouble I have now is how to create a filename using the regex. In a way I should replace the tags with the information. I could do that using regex.replace and another regex, but I feel it's a big weird doing that and it might be a better way.
Thanks
[Edit] Maybe I wasn't really clear in the first go, but the idea is of course that the user (in this case an admin that will know regex syntax) will have the possibility to modify the form of the filename, that's all the idea behind it[/Edit]

... and if the regex changes, it is next to impossible to reconstruct a string from a given regex.
Edit:
Create some predefined "place-holders": %u could be the user's name, %y could be the year, etc.:
backup_%m_%y_%u.bak
and then simple replace the %? with their actual values.

It sounds like you're trying to use the regular expression to create the file name from a pattern which the user should be able to specify.
Regular expressions can - AFAIK - not be used to create output, but only to validate input, so you'd have the user specify two things:
a file name production pattern like Bart suggested
a validation pattern in form of a regular expression that helps you split the file names into their parts
EDIT
By the way, your sample regex contains an error: The "." is use for "any character", also \w only matches one word character, so I guess you meant to write
"^backup_(?<Month>\d{2})_(?<Year>\d{2})_(?<Username>\w+)\.(?<extension>bak)$"

If the filename is always in this form, there is no reason for a regex, as it's easier to process with string.Split ...

With Bart's solution it is easy enough to split (using string.Split) the generated file name using underscore as the delimiter, to get back the information.

Ok, I think I have found a way to use only the regex. As I am using groups to get the information, I will use another regular expression to match the regular expression and replace the groups with the value:
Regex rgx = new Regex("\(\?\<Month\>.+?\)");
rgx.Replace("^backup_(?<Month>\d{2})_(?<Year>\d{2})_(?<Username>\w+)\.(?<extension>bak)$"
, DateTime.Now.Month.ToString());
Ok, it's really a hack, but at least it works and I have only one pattern defined by the user. It might not work if the regex is too complex, but I think I can deal with that problem.
What do you think?

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.