Converting wildcard pattern to regular expression

Converting wildcard pattern to regular expression - c#

I am new to regular expressions. Recently I was presented with a task to convert a wildcard pattern to regular expression. This will be used to check if a file path matches the regex.
For example if my pattern is *.jpg;*.png;*.bmp
I was able to generate the regex by spliting on semicolons, escaping the string and replaceing the escaped * with .*
String regex = "((?i)" + Regex.Escape(extension).Replace("\\*", ".*") + "$)";
So my resulting regex will be for jpg ((?i).*\.jpg)$)
Thien I combine all my extensions using the OR operator.
Thus my final expression for this example will be:
((?i).*\.jpg)$)|((?i).*\.png)$)|((?i).*\.bmp)$)
I have tested it and it worked yet I am not sure if I should add or remove any expression to cover other cases or is there a better format the whole thing
Also bear in mind that I can encounter a wildcard like *myfile.jpg where it should match all files whose names end with myfile.jpg
I can encounter patterns like *myfile.jpg;*.png;*.bmp

There's a lot of grouping going on there which isn't really needed... well unless there's something you haven't mentioned this regex would do the same for less:
/.*\.(jpg|png|bmp)$/i
That's in regex notation, in C# that would be:
String regex=new RegEx(#".*\.(jpg|png|bmp)$",RegexOptions.IgnoreCase);
If you have to programatically translate between the two, you've started on the right track - split by semicolon, group your extensions into the set (without the preceding dot). If your wildcard patterns can be more complicated (extensions with wildcards, multi-wildcard starting matches) it might need a bit more work ;)
Edit: (For your update)
If the wild cards can be more complicated, then you're almost there. There's an optimization in my above code that pulls the dot out (for extension) which has to be put back in so you'd end up with:
/.*(myfile\.jpg|\.png|\.bmp)$/i
Basically '*' -> '.*', '.' -> '\.'(gets escaped), rest goes into the set. Basically it says match anything ending (the dollar sign anchors to the end) in myfile.jpg, .png or .bmp.

Related

Regular expression that works on dots

I have this regular expression :
string[] values = Regex
.Matches(mystring4, #"([\w-[\d]][\w\s-[\d]]+)|([0-9]+)")
.OfType<Match>()
.Select(match => match.Value.Trim())
.ToArray();
This regular expression turns this string :
MY LIMITED COMPANY (52100000 / 58447000)";
To these strings :
MY LIMITED COMPANY - 52100000 - 58447000
This also works on non-English characters.
But there is one problem, when I have this string : MY. LIMITED. COMPANY. , it splits that too. I don't want that. I don't want that regular expression to work on dots. How can I do that? Thanks.

You may add the dot after each \w in your pattern, and I also suggest removing unnecessary ( and ):
string[] values = Regex
.Matches("MY. LIMITED. COMPANY. (52100000 / 58447000)", #"[\w.-[\d]][\w.\s-[\d]]+|[0-9]+")
.OfType<Match>()
.Select(match => match.Value.Trim())
.ToArray();
foreach (var s in values)
Console.WriteLine(s);
See the C# demo
Pattern:
[\w.-[\d]] - one Unicode letter or underscore ([\w-[\d]]) or a dot (.)
[\w.\s-[\d]]+ - 1 or more (due to + quantifier at the end) characters that are either Unicode letters or underscore, ., or whitespace (\s)
| - or
[0-9]+ - one or more ASCII-only digits

I'd simplify the expression. What if the names in the front include numbers? Not that my solution doesn't exactly mimic the original expression. It will allow numbers in the name part.
Let's start from the beginning:
To match words all you need is a sequence of word characters:
\w+
This will match any alphanumerical characters including underscores (_).
Considering you want the possibility of the word ending with a dot, you can add it and make it optional (one or zero matches):
\w+\.?
Note the escape to make it an actual character rather than a character class "any character".
To match another potential word following, we now simply duplicate this match, add a white space before, and once again make it optional using the * quantifier:
\w+\.?(?:\w+\.?)*
In case you haven't seen a group starting with ?: is a non-matching group. In essence this works like a usual group, but won't save a matching group in your results.
And that's it already. This pattern will split your demo string as expected. Of course there could be other possible characters not being covered by this.
You can see the results of this matching online here and also play around with it.
To test your regular expressions (and to learn them), I'd really recommend you using a tool such as http://regex101.com
It has an input mask allowing you to provide your pattern and your target string. On the right hand side it will first explain the pattern to you (to see if it's indeed what you had in mind) and below it will show all the groups matched. Just keep in mind it actually uses slightly different flavors of regular expressions, but this shouldn't matter for such simple patterns. (I'm not affiliated with that site, just consider it really useful.)
As an alternative, to directly use C#'s regex parser, you can also try this Regex Tester. This works in a similar way, although doesn't include any explanations, which might be not as ideal for someone just getting started.

Should i use regular expression in this situation?

I have a xml file containing certain expressions like this :-
1. AAaaaaa-1111
2. AAaaa-1111-aaa
3. AA11111-11111
4. AA111-111-111111
(AA static text) (aaaa-Any alphabet only) then hyphen (1111 - any digit only)
I was thinking i should write regular expression for these I believe regex should be the right approach.
But this XML file is dynamic. User can remove or add different expressions in the list. So How can i use regular expression here? Is there any dynamic regular expression kind of thing. Show me the light here please.
UPDATE:- I am using these expressions to validate user input. So whatever user is entering in a box, it should be matched with any of these expressions from the list.
For Example:-
If user enters
AAabc-4567-trr
, then it should be validated coz it matches with 2nd expression in the list

Well,
What I assume from your question is that:
A is the letter A
a is any letter
1 is any number
That's the only way I see AAabc-4567-trr matches AAaaa-1111-aaa
Is that correct?
If it is correct, yes, you could use Regular Expressions. What you need to do is translate your patterns to regex patterns. Assuming you have a new pattern:
AAA-aaa-111
to obtain the regex that will recognize that pattern, all you have to do is translate that pattern into regex patterns. For example:
string xmlPattern = "AAA-aaa-111"
string regexPattern = xmlPattern.Replace("a", "[a-zA-Z]").Replace("1", #"\d");
Edit:
You should take in count other characters that have special meanings in Regular Expressions, and translate/encode them properly. Maybe classify them. For example, these characters:
., $, ^
can be easily translated to regex patterns just encoding them with a \ before, so they will become:
\., \$, \^, ...
If you can specify what is the format of the validation patterns you are storing in the XML files, I could help you a little more, but I'm just writing this answer kind of blind ;)

Regular expressions that match certain sets of characters in a certain order are fairly simple. For example, this will match #2 (AAaaa-1111-aaa):
[A-Z]{2}[a-z]{3}-[0-9]{4}-[a-z]{3}
Breaking it down:
[A-Z]: Any character from A to Z. So any alphabetic, uppercase character.
{2}: Two of the previous item.
The rest of it works in the same way. The hyphens between things are there to match the hyphens in your expected input.

A beginner's guide to interpreting Regex?

Greetings.
I've been tasked with debugging part of an application that involves a Regex -- but, I have never dealt with Regex before. Two questions:
1) I know that the regexes are supposed to be testing whether or not two strings are equivalent, but what specifically do the two regex statements, below, mean in plain English?
2) Does anyone have a recommendation on websites / sources where I can learn more about Regexes? (preferably in C#)
if (Regex.IsMatch(testString, #"^(\s*?)(" + tag + #")(\s*?),", RegexOptions.IgnoreCase))
{
result = true;
}
else if (Regex.IsMatch(testString, #",(\s*?)(" + tag + #")(\s*?),", RegexOptions.IgnoreCase))
{
result = true;
}

It's going to be difficult to tell what that regex means, without knowing what's in tag. In fact, it looks like that regex is broken (or, at least, doesn't properly escape inputs).
Roughly speaking, for the first regex:
The ^ says to match at the beginning of the string.
The (...) sets up a capturing group (which is available, although this example apparently doesn't use it).
The \s matches any white space characters (spaces, tabs, etc.)
The *? matches zero or more of the previous character (in this case, whitespace), and because it has a question-mark, it matches the minimum number of characters needed to make the rest of the expression work.
The (" + tag + #") inserts the contents of the tag into the regex. As I mention, that's dangerous, without escaping.
The (\s*?) matches the same as the before (the minimum number of whitespace characters)
The , matches a trailing comma.
The second regex is very similar, but looks for a starting comma (rather than the beginning of the string).
I like the Python documentation for Regular Expressions, but it looks like this site
has a pretty good, basic introduction, with C# examples.

One word - Cribsheet (or is that two?) :)

I'm not c# savvy but I can recommend an awesome guide to regular expressions that I use for Bash and Java programming. It applies to pretty much all languages:
http://www.amazon.com/Mastering-Regular-Expressions-Jeffrey-Friedl/dp/0596528124/ref=tmm_pap_title_0
It is totally worth $30 to own this book. It is VERY thorough and helped my fundamental understanding of Regex a lot.
-Ryan

Since you specifically tagged C#, I recommend the Regex Hero as a tool you can use to play around with them since it's running on .NET. It also lets you toggle the different RegexOptions flags as you would pass them into the constructor when creating a new Regex.
Also, if you're using a version of Visual Studio 2010 that supports extensions, I would take a look at the Regex Editor extension... it will popup whenever you type new Regex( and offer you some guidance and autocomplete for your regex pattern.

Using The Regex Coach
The regular expression is a sequence consisting of the expression '(\s*?)', the expression '(tag)', the expression '(\s*?)', and the character ','.
where (\s*?) is defined as The regular expression is a repetition which matches a whitespace character as often as necessary.
the second one matches a , at the start too
As for good learning websites, I like www.regular-expressions.info/
Super simple version:
At the start of a string 0 or more spaces, whatever Tag is, 0 or More spaces, a comma.
the second one is
a comma, 0 or more spaces, whatever Tag is, 0 or More spaces, a comma.

Once you have the very basic idea about regex (it's full of resources over there) I recommend you to use Expresso for creating your regular expressions.
Expresso editor is equally suitable as a teaching tool for the beginning user of regular expressions or as a full-featured development environment for the experienced programmer or web designer with an extensive knowledge of regular expressions.

Your premise is not correct. Regular expressions are not used to tell if two strings are equivalent, but rather if the input string matches a certain pattern.
The first test above looks for any text that does not contain "zero or more whitespace charaters" searching "non-greedy". Then matches the text of the variable "tag" in the middle, then "zero or more whitespace characters, non greedy" again.
The second one is very similar, except that it allows for beginning whitespace as long as it follows a comma.
It is hard to explain "non-greedy" in this context, especially involving whitespace characters, so look here for more information.

A regular expression is a way to describe a set of strings that have some particular characteristics.
They don't merely need just to compare two strings.. what you usually do it to test if a string matches a particular regular expression. They can also be used to do simple parsing of a string in tokens that respect some patterns..
The good thing about regexps is that they allow you to express certain constraints inside a string keeping it general and able to match a group of strings that respect those constraints.. then they follow a formal specification that doesn't leave ambiguities around..
Here you can find a comparison table of various regular expression languages in many different programming languages and a specific guide for C# if you follow its link.
Usually the implementations for the various languages are quite similar since the syntax is somewhat standardized from the theoretical topics regexps come from, so any tutorial about regexp will be fine, then you'll just need to get into C# API.

1) The first regex is trying to do a case-insensitive match starting at the beginning of the test string. It then matches optional whitespace, followed by whatever is in tag, followed by optional whitespace then finally a comma.
The second matches a string containing a comma, followed by optional whitespace, followed by whatever is in tag, followed by optional whitespace then finally a comma.
Thought it's for C# I recommend picking up the Perl Pocket Reference which has a great Regex syntax reference. It helped my out a lot when I was learning regexes 14 years ago.

http://www.myregextester.com/ is a decent regular expression tester that also has an explain option for C# regexps - For Instance check out this example:
The regular expression:
(?-imsx:^(\s*?)(tagtext)(\s*?),)
matches as follows:
NODE EXPLANATION
----------------------------------------------------------------------
(?-imsx: group, but do not capture (case-sensitive)
(with ^ and $ matching normally) (with . not
matching \n) (matching whitespace and #
normally):
----------------------------------------------------------------------
^ the beginning of the string
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
\s*? whitespace (\n, \r, \t, \f, and " ") (0
or more times (matching the least amount
possible))
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
( group and capture to \2:
----------------------------------------------------------------------
tagtext 'tagtext'
----------------------------------------------------------------------
) end of \2
----------------------------------------------------------------------
( group and capture to \3:
----------------------------------------------------------------------
\s*? whitespace (\n, \r, \t, \f, and " ") (0
or more times (matching the least amount
possible))
----------------------------------------------------------------------
) end of \3
----------------------------------------------------------------------
, ','
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------

A regular expression does not tell you if two strings match, but rather if a given string matches a pattern.
This site is my favorite for learning and testing regular expressions:
http://gskinner.com/RegExr/
It allows you to interactively test regular expressions as you write them, and provides a built-in tutorial.

Although it doesn't use C#, Rejex is a simple tool for testing and learning about regular expressions which includes a quick reference for the special characters

It looks like that they are trying to match some kind of list of words delimited by colons (UPDATE: commas).
The first one is probably matching first item and the second one some item after the first one excluding the last one. I hope you will understand :).
A good source of information about regular expressions is at http://www.regular-expressions.info/

also a great site to test your regular expressions with extra info: http://regex101.com/

Regular Expression to reject special characters other than commas

I am working in asp.net. I am using Regular Expression Validator
Could you please help me in creating a regular expression for not allowing special characters other than comma. Comma has to be allowed.
I checked in regexlib, however I could not find a match. I treid with ^(a-z|A-Z|0-9)*[^#$%^&*()']*$ . When I add other characters as invalid, it does not work.
Also could you please suggest me a place where I can find a good resource of regular expressions? regexlib seems to be big; but any other place which lists very limited but most used examples?
Also, can I create expressions using C# code? Any articles for that?

[\w\s,]+
works fine, as you can see bellow.
RegExr is a great place to test your regular expressions with real time results, it also comes with a very complete list of common expressions.
[] character class \w Matches any word character (alphanumeric & underscore). \s
Matches any whitespace character (spaces, tabs, line breaks). , include comma + is greedy match; which will match the previous 1 or more times.

[\d\w\s,]*
Just a guess

To answer on any articles, I got started here, find it to be an excellent resource:
http://www.regular-expressions.info/
For your current problem, try something like this:
[\w\s,]*
Here's a breakdown:
Match a single character present in the list below «[\w\s,]*»
Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
A word character (letters, digits, etc.) «\w»
A whitespace character (spaces, tabs, line breaks, etc.) «\s»
The character “,” «,»

For a single character that is not a comma, [^,] should work perfectly fine.

You can try [\w\s,] regular expression. This regex will match only alpha-numeric characters and comma. If any other character appears within text, then this wont match.
For your second question regarding regular expression resource, you can goto
http://www.regular-expressions.info/
This website has lot of tutorials on regex, plus it has lot of usefult information.
Also, can I create expressions using
C# code? Any articles for that?
By this, do you mean to say you want to know which class and methods for regular expression execution? Or you want tool that will create regular expression for you?

You can create expressions with C#, something like this usually does the trick:
Regex regex = new Regex(#"^[a-z | 0-9 | /,]*$", RegexOptions.IgnoreCase);
System.Console.Write("Enter Text");
String s = System.Console.ReadLine();
Match match = regex.Match(s);
if (match.Success == true)
{
System.Console.WriteLine("True");
}
else
{
System.Console.WriteLine("False");
}
System.Console.ReadLine();
You need to import the System.Text.RegularExpressions;
The regular expression above, accepts only numbers, letters (both upper and lower case) and the comma.
For a small introduction to Regular Expressions, I think that the book for MCTS 70-536 can be of a big help, I am pretty sure that you can either download it from somewhere or obtain a copy.
I am assuming that you never messed around with regular expressions in C#, hence I provided the code above.
Hope this helps.

Thank you, all..
[\w\s,]* works
Let me go through regular-expressions.info and come back if I need further support.
Let me try the C# code approach and come back if I need further support.
[This forum is awesome. Quality replies so qucik..]
Thanks again

(…) is denoting a grouping and not a character set that’s denoted with […]. So try this:
^[a-zA-Z0-9,]*$
This will only allow alphanumeric characters and the comma.

Regex search and replace where the replacement is a mod of the search term

i'm having a hard time finding a solution to this and am pretty sure that regex supports it. i just can't recall the name of the concept in the world of regex.
i need to search and replace a string for a specific pattern but the patterns can be different and the replacement needs to "remember" what it's replacing.
For example, say i have an arbitrary string: 134kshflskj9809hkj
and i want to surround the numbers with parentheses,
so the result would be: (134)kshflskj(9809)hkj
Finding numbers is simple enough, but how to surround them?
Can anyone provide a sample or point me in the right direction?

In some various langauges:
// C#:
string result = Regex.Replace(input, #"(\d+)", "($1)");
// JavaScript:
thestring.replace(/(\d+)/g, '($1)');
// Perl:
s/(\d+)/($1)/g;
// PHP:
$result = preg_replace("/(\d+)/", '($1)', $input);
The parentheses around (\d+) make it a "group" specifically the first (and only in this case) group which can be backreferenced in the replacement string. The g flag is required in some implementations to make it match multiple times in a single string). The replacement string is fairly similar although some languages will use \1 instead of $1 and some will allow both.

Most regex replacement functions allow you to reference capture groups specified in the regex (a.k.a. backreferences), when defining your replacement string. For instance, using preg_replace() from PHP:
$var = "134kshflskj9809hkj";
$result = preg_replace('/(\d+)/', '(\1)', $var);
// $result now equals "(134)kshflskj(9809)hkj"
where \1 means "the first capture group in the regex".

Another somewhat generic solution is this:
search : /([\d]+)([^\d]*)/g
replace: ($1)$2
([\d]+): match a set of one or more digits and retain them in a group
([^\d]*): match a set of non-digits, and retain them as well. \D could work here, too.
g: indicate this is a global expression, to work multiple times on the input.
($1): in the replace block, parens have no special meaning, so output the first group, surrounding it with parens.
$2: output the second group
I used a pretty good online regex tool to test out my expression. The next step would be to apply it to the language that you are using, as each has its own implemention nuance.

Backreferences (grouping) are not necessary if you're just looking to search for numbers and replace with the found regex surrounded by parens. It is simpler to use the whole regex match in the replacement string.
e.g for perl
$text =~ s/\d+/($&)/g;
This searches for 1 or more digits and replaces with parens surrounding the match (specified by $&), with trailing g to find and replace all occurrences.
see http://www.regular-expressions.info/refreplace.html for the correct syntax for your regex language.

Depending on your language, you're looking to match groups.
So typically you'll make a pattern in the form of
([0-9]{1,})|([a-zA-Z]{1,})
Then, you'll iterate over the resulting groups in (specific to your language).

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.