I'm looking for a regular expression to extract a string from a file name
eg if filename format is "anythingatallanylength_123_TESTNAME.docx", I'm interested in extracting "TESTNAME" ... probably fixed length of 8. (btw, 123 can be any three digit number)
I think I can use regex match ...
".*_[0-9][0-9][0-9]_[A-Z][A-Z][A-Z][A-Z][A-Z][A-Z][A-Z][A-Z].docx$"
However this matches the whole thing. How can I just get "TESTNAME"?
Thanks
Use parenthesis to match a specific piece of the whole regex.
You can also use the curly braces to specify counts of matching characters, and \d for [0-9].
In C#:
var myRegex = new Regex(#"*._\d{3}_([A-Za-z]{8})\.docx$");
Now "TESTNAME" or whatever your 8 letter piece is will be found in the captures collection of your regex after using it.
Also note, there will be a performance overhead for look-ahead and look-behind, as presented in some other solutions.
You can use a look-behind and a look-ahead to check parts without matching them:
(?<=_[0-9]{3}_)[A-Z]{8}(?=\.docx$)
Note that this is case-sensitive, you may want to use other character classes and/or quantifiers to fit your exact pattern.
In your file name format "anythingatallanylength_123_TESTNAME.docx", the pattern you are trying to match is a string before .docx and the underscore _. Keeping the thing in mind that any _ before doesn't get matched I came up with following solution.
Regex: (?<=_)[A-Za-z]*(?=\.docx$)
Flags used:
g global search
m multi-line search.
Explanation:
(?<=_) checks if there is an underscore before the file name.
(?=\.docx$) checks for extension at the end.
[A-Za-z]* checks the required match.
Regex101 Demo
Thanks to #Lucero #noob #JamesFaix I came up with ...
#"(?<=.*[0-9]{3})[A-Z]{8}(?=.docx$)"
So a look behind (in brackets, starting with ?<=) for anything (ie zero or more any char (denoted by "." ) followed by an underscore, followed by thee numerics, followed by underscore. Thats the end of the look behind. Now to match what I need (eight letters). Finally, the look ahead (in brackets, starting with ?=), which is the .docx
Nice work, fellas. Thunderbirds are go.
So basically I have this giant regular expression pattern, and somewhere in the middle of it is the expression (?:\s(\d\d\d)|(\d\d\d\d)). At this part of the parse I'm wanting to capture either 3 digits that follows a space or 4 digits, but I don't want the capture that comes from using the parenthesis around the whole thing (doesn't ?: make something non-capture). I have to use parenthesis so that the "or" logic works (I think).
So potential example inputs would be something like...
input1= giantexpression 123more characters after
input2= giantexpression1234blahblahblah
I tried (?:\s(\d\d\d)|(\d\d\d\d)) and it gave an extra capture at least in the case where I have 4 digits. So am I doing this right or am I messed up somewhere?
Edit:
To go into more detail... here's the current regular expression I'm working with.
pattern = #".?(\d{1,2})\s*(\w{2}).?.?.?(?:\s(\d\d\d)|(\d\d\d\d)).*"
There's a bit of parsing I have to do at the beginning. I think Sean Johnson's answer would still work because I wouldn't need to use "or". But is there a way to do it in which you DO use "or"? I think eventually I'll need that capability.
This should work:
(?:\s(\d{3,4}))
If you aren't doing any logic on that subpattern, you don't even need the parenthesis surrounding it if all you want to do is capture the digits. The following pattern:
\s(\d{3,4})
will capture three or four digits directly following a space character.
I have a xml file containing certain expressions like this :-
1. AAaaaaa-1111
2. AAaaa-1111-aaa
3. AA11111-11111
4. AA111-111-111111
(AA static text) (aaaa-Any alphabet only) then hyphen (1111 - any digit only)
I was thinking i should write regular expression for these I believe regex should be the right approach.
But this XML file is dynamic. User can remove or add different expressions in the list. So How can i use regular expression here? Is there any dynamic regular expression kind of thing. Show me the light here please.
UPDATE:- I am using these expressions to validate user input. So whatever user is entering in a box, it should be matched with any of these expressions from the list.
For Example:-
If user enters
AAabc-4567-trr
, then it should be validated coz it matches with 2nd expression in the list
Well,
What I assume from your question is that:
A is the letter A
a is any letter
1 is any number
That's the only way I see AAabc-4567-trr matches AAaaa-1111-aaa
Is that correct?
If it is correct, yes, you could use Regular Expressions. What you need to do is translate your patterns to regex patterns. Assuming you have a new pattern:
AAA-aaa-111
to obtain the regex that will recognize that pattern, all you have to do is translate that pattern into regex patterns. For example:
string xmlPattern = "AAA-aaa-111"
string regexPattern = xmlPattern.Replace("a", "[a-zA-Z]").Replace("1", #"\d");
Edit:
You should take in count other characters that have special meanings in Regular Expressions, and translate/encode them properly. Maybe classify them. For example, these characters:
., $, ^
can be easily translated to regex patterns just encoding them with a \ before, so they will become:
\., \$, \^, ...
If you can specify what is the format of the validation patterns you are storing in the XML files, I could help you a little more, but I'm just writing this answer kind of blind ;)
Regular expressions that match certain sets of characters in a certain order are fairly simple. For example, this will match #2 (AAaaa-1111-aaa):
[A-Z]{2}[a-z]{3}-[0-9]{4}-[a-z]{3}
Breaking it down:
[A-Z]: Any character from A to Z. So any alphabetic, uppercase character.
{2}: Two of the previous item.
The rest of it works in the same way. The hyphens between things are there to match the hyphens in your expected input.
can some tell how can i write regular expression matching abc.com.ae or abc.net.af or anything ,ae which is the last in the string is optional
this is successfull with / but not . don't know why
^[a-z]{1,25}.[a-z]{3}$
answer
^[a-z]{1,30}\.[a-z]{3}((\.)[a-z]{2})?$
The answer is: write \. instead of ., because . means “any character”.
You can match specific characters using the \u escape code followed immediately by the unicode number for that character in hex format and it must also be four digits only. I think a full stop is \u002E.
In your example where '/' works but not '.', this is because a '/' is recognised as a normal character and does not need to be escaped whereas the full stop has a different meaning in c# regex (it matches any character).
If youre still not sure, here is a useful guide to regex in C#: http://www.radsoftware.com.au/articles/regexlearnsyntax.aspx
And this one has examples of regex for URLs: http://www.radsoftware.com.au/articles/regexsyntaxadvanced.aspx
I sounds like you need some environment to learn regular expressions.
The best way to learn regular expressions is with an interactive regular expression simulator: type in an expression, and it will give you the results, instantly.
There are a few free (and not so free) regular expression simulators available:
Free web version from http://lumadis.be/regex/test_regex.php.
RegExBuddy from http://www.regexbuddy.com/. I am not affiliated with this program, but I have used it with good success in the past.
Greetings.
I've been tasked with debugging part of an application that involves a Regex -- but, I have never dealt with Regex before. Two questions:
1) I know that the regexes are supposed to be testing whether or not two strings are equivalent, but what specifically do the two regex statements, below, mean in plain English?
2) Does anyone have a recommendation on websites / sources where I can learn more about Regexes? (preferably in C#)
if (Regex.IsMatch(testString, #"^(\s*?)(" + tag + #")(\s*?),", RegexOptions.IgnoreCase))
{
result = true;
}
else if (Regex.IsMatch(testString, #",(\s*?)(" + tag + #")(\s*?),", RegexOptions.IgnoreCase))
{
result = true;
}
It's going to be difficult to tell what that regex means, without knowing what's in tag. In fact, it looks like that regex is broken (or, at least, doesn't properly escape inputs).
Roughly speaking, for the first regex:
The ^ says to match at the beginning of the string.
The (...) sets up a capturing group (which is available, although this example apparently doesn't use it).
The \s matches any white space characters (spaces, tabs, etc.)
The *? matches zero or more of the previous character (in this case, whitespace), and because it has a question-mark, it matches the minimum number of characters needed to make the rest of the expression work.
The (" + tag + #") inserts the contents of the tag into the regex. As I mention, that's dangerous, without escaping.
The (\s*?) matches the same as the before (the minimum number of whitespace characters)
The , matches a trailing comma.
The second regex is very similar, but looks for a starting comma (rather than the beginning of the string).
I like the Python documentation for Regular Expressions, but it looks like this site
has a pretty good, basic introduction, with C# examples.
One word - Cribsheet (or is that two?) :)
I'm not c# savvy but I can recommend an awesome guide to regular expressions that I use for Bash and Java programming. It applies to pretty much all languages:
http://www.amazon.com/Mastering-Regular-Expressions-Jeffrey-Friedl/dp/0596528124/ref=tmm_pap_title_0
It is totally worth $30 to own this book. It is VERY thorough and helped my fundamental understanding of Regex a lot.
-Ryan
Since you specifically tagged C#, I recommend the Regex Hero as a tool you can use to play around with them since it's running on .NET. It also lets you toggle the different RegexOptions flags as you would pass them into the constructor when creating a new Regex.
Also, if you're using a version of Visual Studio 2010 that supports extensions, I would take a look at the Regex Editor extension... it will popup whenever you type new Regex( and offer you some guidance and autocomplete for your regex pattern.
Using The Regex Coach
The regular expression is a sequence consisting of the expression '(\s*?)', the expression '(tag)', the expression '(\s*?)', and the character ','.
where (\s*?) is defined as The regular expression is a repetition which matches a whitespace character as often as necessary.
the second one matches a , at the start too
As for good learning websites, I like www.regular-expressions.info/
Super simple version:
At the start of a string 0 or more spaces, whatever Tag is, 0 or More spaces, a comma.
the second one is
a comma, 0 or more spaces, whatever Tag is, 0 or More spaces, a comma.
Once you have the very basic idea about regex (it's full of resources over there) I recommend you to use Expresso for creating your regular expressions.
Expresso editor is equally suitable as a teaching tool for the beginning user of regular expressions or as a full-featured development environment for the experienced programmer or web designer with an extensive knowledge of regular expressions.
Your premise is not correct. Regular expressions are not used to tell if two strings are equivalent, but rather if the input string matches a certain pattern.
The first test above looks for any text that does not contain "zero or more whitespace charaters" searching "non-greedy". Then matches the text of the variable "tag" in the middle, then "zero or more whitespace characters, non greedy" again.
The second one is very similar, except that it allows for beginning whitespace as long as it follows a comma.
It is hard to explain "non-greedy" in this context, especially involving whitespace characters, so look here for more information.
A regular expression is a way to describe a set of strings that have some particular characteristics.
They don't merely need just to compare two strings.. what you usually do it to test if a string matches a particular regular expression. They can also be used to do simple parsing of a string in tokens that respect some patterns..
The good thing about regexps is that they allow you to express certain constraints inside a string keeping it general and able to match a group of strings that respect those constraints.. then they follow a formal specification that doesn't leave ambiguities around..
Here you can find a comparison table of various regular expression languages in many different programming languages and a specific guide for C# if you follow its link.
Usually the implementations for the various languages are quite similar since the syntax is somewhat standardized from the theoretical topics regexps come from, so any tutorial about regexp will be fine, then you'll just need to get into C# API.
1) The first regex is trying to do a case-insensitive match starting at the beginning of the test string. It then matches optional whitespace, followed by whatever is in tag, followed by optional whitespace then finally a comma.
The second matches a string containing a comma, followed by optional whitespace, followed by whatever is in tag, followed by optional whitespace then finally a comma.
Thought it's for C# I recommend picking up the Perl Pocket Reference which has a great Regex syntax reference. It helped my out a lot when I was learning regexes 14 years ago.
http://www.myregextester.com/ is a decent regular expression tester that also has an explain option for C# regexps - For Instance check out this example:
The regular expression:
(?-imsx:^(\s*?)(tagtext)(\s*?),)
matches as follows:
NODE EXPLANATION
----------------------------------------------------------------------
(?-imsx: group, but do not capture (case-sensitive)
(with ^ and $ matching normally) (with . not
matching \n) (matching whitespace and #
normally):
----------------------------------------------------------------------
^ the beginning of the string
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
\s*? whitespace (\n, \r, \t, \f, and " ") (0
or more times (matching the least amount
possible))
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
( group and capture to \2:
----------------------------------------------------------------------
tagtext 'tagtext'
----------------------------------------------------------------------
) end of \2
----------------------------------------------------------------------
( group and capture to \3:
----------------------------------------------------------------------
\s*? whitespace (\n, \r, \t, \f, and " ") (0
or more times (matching the least amount
possible))
----------------------------------------------------------------------
) end of \3
----------------------------------------------------------------------
, ','
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
A regular expression does not tell you if two strings match, but rather if a given string matches a pattern.
This site is my favorite for learning and testing regular expressions:
http://gskinner.com/RegExr/
It allows you to interactively test regular expressions as you write them, and provides a built-in tutorial.
Although it doesn't use C#, Rejex is a simple tool for testing and learning about regular expressions which includes a quick reference for the special characters
It looks like that they are trying to match some kind of list of words delimited by colons (UPDATE: commas).
The first one is probably matching first item and the second one some item after the first one excluding the last one. I hope you will understand :).
A good source of information about regular expressions is at http://www.regular-expressions.info/
also a great site to test your regular expressions with extra info: http://regex101.com/