Convert regex pattern from php to C# [duplicate] - c#

I have a regex expression that I tested on http://gskinner.com/RegExr/ and it worked, but when I used it in my C# application it failed.
My regex expression: (?<!\d)\d{6}\K\d+(?=\d{4}(?!\d))
Text: 4000751111115425
Result: 111111
What is wrong with my regex expression?

This issue you are having is that .NET regular expressions do not support \K, "discard what has been matched so far".
I believe your regex translates as "match any string of more than ten \d digits, to as many digits as possible, and discard the first 6 and the last 4".
I believe that the .NET-compliant regex
(?<=\d{6})\d+(?=\d{4})
achieves the same thing. Note that the negative lookahead/behind for no-more-\ds is not necessary as the \d+ is greedy - the engine already will try to match as many digits as possible.

In general, \K operator (that discards all text matched so far from the match memory buffer) can be emulated with two techniques:
Lookarounds
Capturing groups (with and without lookarounds).
For example,
PCRE a+b+c+=\K\d+ (demo) = .NET (?<=a+b+c+=)\d+ or a+b+c+=(\d+) (and grab Group 1 value)
PCRE ^[^][]+\K.* (demo) = .NET (?<=^[^][]+)(?:\[.*)?$ (demo) or (better here) ^[^][]+(.*) (demo).
The problem with the second example is that [^][]+ can match the same text as .* (these patterns overlap) and since there is no clear boundary between the two patterns, just using a lookbehind is not actually working and needs additional tricks to make it work.
Capturing group approach is universal here and should work in all situations.
Since \K makes the regex engine "forget" the part of a match consumed so far, the best approach here is to use a capturing group to grab the part of a match you need to obtain after the left-hand context:
using System;
using System.Text.RegularExpressions;
public class Test
{
public static void Main()
{
var text = "Text 4000751111115425";
var result = Regex.Match(text, #"(?<!\d)\d{6}(\d+)(?=\d{4}(?!\d))")?.Groups[1].Value;
Console.WriteLine($"Result: '{result}'");
}
}
See the online C# demo and the regex demo (see Table tab for the proper result table). Details:
(?<!\d) - a left-hand digit boundary
\d{6} - six digits
(\d+) - Capturing group 1: one or more digits
(?=\d{4}(?!\d)) - a positive lookahead that matches a location that is immediately followed with four digits not immediately followed with another digit.

Related

RegEx expression works on regex101 but not in C# [duplicate]

https://regex101.com/r/sB9wW6/1
(?:(?<=\s)|^)#(\S+) <-- the problem in positive lookbehind
Working like this on prod: (?:\s|^)#(\S+), but I need a correct start index (without space).
Here is in JS:
var regex = new RegExp(/(?:(?<=\s)|^)#(\S+)/g);
Error parsing regular expression: Invalid regular expression:
/(?:(?<=\s)|^)#(\S+)/
What am I doing wrong?
UPDATE
Ok, no lookbehind in JS :(
But anyways, I need a regex to get the proper start and end index of my match. Without leading space.
Make sure you always select the right regex engine at regex101.com. See an issue that occurred due to using a JS-only compatible regex with [^] construct in Python.
JS regex - at the time of answering this question - did not support lookbehinds. Now, it becomes more and more adopted after its introduction in ECMAScript 2018. You do not really need it here since you can use capturing groups:
var re = /(?:\s|^)#(\S+)/g;
var str = 's #vln1\n#vln2\n';
var res = [];
while ((m = re.exec(str)) !== null) {
res.push(m[1]);
}
console.log(res);
The (?:\s|^)#(\S+) matches a whitespace or the start of string with (?:\s|^), then matches #, and then matches and captures into Group 1 one or more non-whitespace chars with (\S+).
To get the start/end indices, use
var re = /(\s|^)#\S+/g;
var str = 's #vln1\n#vln2\n';
var pos = [];
while ((m = re.exec(str)) !== null) {
pos.push([m.index+m[1].length, m.index+m[0].length]);
}
console.log(pos);
BONUS
My regex works at regex101.com, but not in...
First of all, have you checked the Code Generator link in the Tools pane on the left?
All languages - "Literal string" vs. "String literal" alert - Make sure you test against the same text used in code, literal string, at the regex tester. A common scenario is copy/pasting a string literal value directly into the test string field, with all string escape sequences like \n (line feed char), \r (carriage return), \t (tab char). See Regex_search c++, for example. Mind that they must be replaced with their literal counterparts. So, if you have in Python text = "Text\n\n abc", you must use Text, two line breaks, abc in the regex tester text field. Text.*?abc will never match it although you might think it "works". Yes, . does not always match line break chars, see How do I match any character across multiple lines in a regular expression?
All languages - Backslash alert - Make sure you correctly use a backslash in your string literal, in most languages, in regular string literals, use double backslash, i.e. \d used at regex101.com must written as \\d. In raw string literals, use a single backslash, same as at regex101. Escaping word boundary is very important, since, in many languages (C#, Python, Java, JavaScript, Ruby, etc.), "\b" is used to define a BACKSPACE char, i.e. it is a valid string escape sequence. PHP does not support \b string escape sequence, so "/\b/" = '/\b/' there.
All languages - Default flags - Global and Multiline - Note that by default m and g flags are enabled at regex101.com. So, if you use ^ and $, they will match at the start and end of lines correspondingly. If you need the same behavior in your code check how multiline mode is implemented and either use a specific flag, or - if supported - use an inline (?m) embedded (inline) modifier. The g flag enables multiple occurrence matching, it is often implemented using specific functions/methods. Check your language reference to find the appropriate one.
line-breaks - Line endings at regex101.com are LF only, you can't test strings with CRLF endings, see regex101.com VS myserver - different results. Solutions can be different for each regex library: either use \R (PCRE, Java, Ruby) or some kind of \v (Boost, PCRE), \r?\n, (?:\r\n?|\n)/(?>\r\n?|\n) (good for .NET) or [\r\n]+ in other libraries (see answers for C#, PHP). Another issue related to the fact that you test your regex against a multiline string (not a list of standalone strings/lines) is that your patterns may consume the end of line, \n, char with negated character classes, see an issue like that. \D matched the end of line char, and in order to avoid it, [^\d\n] could be used, or other alternatives.
php - You are dealing with Unicode strings, or want shorthand character classes to match Unicode characters, too (e.g. \w+ to match Стрибижев or Stribiżew, or \s+ to match hard spaces), then you need to use u modifier, see preg_match() returns 0 although regex testers work - To match all occurrences, use preg_match_all, not preg_match with /...pattern.../g, see PHP preg_match to find multiple occurrences and "Unknown modifier 'g' in..." when using preg_match in PHP?- Your regex with inline backreference like \1 refuses to work? Are you using a double quoted string literal? Use a single-quoted one, see Backreference does not work in PHP
phplaravel - Mind you need the regex delimiters around the pattern, see https://stackoverflow.com/questions/22430529
python - Note that re.search, re.match, re.fullmatch, re.findall and re.finditer accept the regex as the first argument, and the input string as the second argument. Not re.findall("test 200 300", r"\d+"), but re.findall(r"\d+", "test 200 300"). If you test at regex101.com, please check the "Code Generator" page. - You used re.match that only searches for a match at the start of the string, use re.search: Regex works fine on Pythex, but not in Python - If the regex contains capturing group(s), re.findall returns a list of captures/capture tuples. Either use non-capturing groups, or re.finditer, or remove redundant capturing groups, see re.findall behaves weird - If you used ^ in the pattern to denote start of a line, not start of the whole string, or used $ to denote the end of a line and not a string, pass re.M or re.MULTILINE flag to re method, see Using ^ to match beginning of line in Python regex
- If you try to match some text across multiple lines, and use re.DOTALL or re.S, or [\s\S]* / [\s\S]*?, and still nothing works, check if you read the file line by line, say, with for line in file:. You must pass the whole file contents as the input to the regex method, see Getting Everything Between Two Characters Across New Lines. - Having trouble adding flags to regex and trying something like pattern = r"/abc/gi"? See How to add modifers to regex in python?
c#, .net - .NET regex does not support possessive quantifiers like ++, *+, ??, {1,10}?, see .NET regex matching digits between optional text with possessive quantifer is not working - When you match against a multiline string and use RegexOptions.Multiline option (or inline (?m) modifier) with an $ anchor in the pattern to match entire lines, and get no match in code, you need to add \r? before $, see .Net regex matching $ with the end of the string and not of line, even with multiline enabled - To get multiple matches, use Regex.Matches, not Regex.Match, see RegEx Match multiple times in string - Similar case as above: splitting a string into paragraphs, by a double line break sequence - C# / Regex Pattern works in online testing, but not at runtime - You should remove regex delimiters, i.e. #"/\d+/" must actually look like #"\d+", see Simple and tested online regex containing regex delimiters does not work in C# code - If you unnecessarily used Regex.Escape to escape all characters in a regular expression (like Regex.Escape(#"\d+\.\d+")) you need to remove Regex.Escape, see Regular Expression working in regex tester, but not in c#
dartflutter - Use raw string literal, RegExp(r"\d"), or double backslashes (RegExp("\\d")) - https://stackoverflow.com/questions/59085824
javascript - Double escape backslashes in a RegExp("\\d"): Why do regex constructors need to be double escaped?
- (Negative) lookbehinds unsupported by most browsers: Regex works on browser but not in Node.js - Strings are immutable, assign the .replace result to a var - The .replace() method does change the string in place - Retrieve all matches with str.match(/pat/g) - Regex101 and Js regex search showing different results or, with RegExp#exec, RegEx to extract all matches from string using RegExp.exec- Replace all pattern matches in string: Why does javascript replace only first instance when using replace?
javascriptangular - Double the backslashes if you define a regex with a string literal, or just use a regex literal notation, see https://stackoverflow.com/questions/56097782
java - Word boundary not working? Make sure you use double backslashes, "\\b", see Regex \b word boundary not works - Getting invalid escape sequence exception? Same thing, double backslashes - Java doesn't work with regex \s, says: invalid escape sequence - No match found is bugging you? Run Matcher.find() / Matcher.matches() - Why does my regex work on RegexPlanet and regex101 but not in my code? - .matches() requires a full string match, use .find(): Java Regex pattern that matches in any online tester but doesn't in Eclipse - Access groups using matcher.group(x): Regex not working in Java while working otherwise - Inside a character class, both [ and ] must be escaped - Using square brackets inside character class in Java regex - You should not run matcher.matches() and matcher.find() consecutively, use only if (matcher.matches()) {...} to check if the pattern matches the whole string and then act accordingly, or use if (matcher.find()) to check if there is a single match or while (matcher.find()) to find multiple matches (or Matcher#results()). See Why does my regex work on RegexPlanet and regex101 but not in my code?
scala - Your regex attempts to match several lines, but you read the file line by line (e.g. use for (line <- fSource.getLines))? Read it into a single variable (see matching new line in Scala regex, when reading from file)
kotlin - You have Regex("/^\\d+$/")? Remove the outer slashes, they are regex delimiter chars that are not part of a pattern. See Find one or more word in string using Regex in Kotlin - You expect a partial string match, but .matchEntire requires a full string match? Use .find, see Regex doesn't match in Kotlin
mongodb - Do not enclose /.../ with single/double quotation marks, see mongodb regex doesn't work
c++ - regex_match requires a full string match, use regex_search to find a partial match - Regex not working as expected with C++ regex_match - regex_search finds the first match only. Use sregex_token_iterator or sregex_iterator to get all matches: see What does std::match_results::size return? - When you read a user-defined string using std::string input; std::cin >> input;, note that cin will only get to the first whitespace, to read the whole line properly, use std::getline(std::cin, input); - C++ Regex to match '+' quantifier - "\d" does not work, you need to use "\\d" or R"(\d)" (a raw string literal) - This regex doesn't work in c++ - Make sure the regex is tested against a literal text, not a string literal, see Regex_search c++
go - Double backslashes or use a raw string literal: Regular expression doesn't work in Go - Go regex does not support lookarounds, select the right option (Go) at regex101.com before testing! Regex expression negated set not working golang
groovy - Return all matches: Regex that works on regex101 does not work in Groovy
r - Double escape backslashes in the string literal: "'\w' is an unrecognized escape" in grep - Use perl=TRUE to PCRE engine ((g)sub/(g)regexpr): Why is this regex using lookbehinds invalid in R?
oracle - Greediness of all quantifiers is set by the first quantifier in the regex, see Regex101 vs Oracle Regex (then, you need to make all the quantifiers as greedy as the first one)] - \b does not work? Oracle regex does not support word boundaries at all, use workarounds as shown in Regex matching works on regex tester but not in oracle
firebase - Double escape backslashes, make sure ^ only appears at the start of the pattern and $ is located only at the end (if any), and note you cannot use more than 9 inline backreferences: Firebase Rules Regex Birthday
firebasegoogle-cloud-firestore - In Firestore security rules, the regular expression needs to be passed as a string, which also means it shouldn't be wrapped in / symbols, i.e. use allow create: if docId.matches("^\\d+$").... See https://stackoverflow.com/questions/63243300
google-data-studio - /pattern/g in REGEXP_REPLACE must contain no / regex delimiters and flags (like g) - see How to use Regex to replace square brackets from date field in Google Data Studio?
google-sheets - If you think REGEXEXTRACT does not return full matches, truncates the results, you should check if you have redundant capturing groups in your regex and remove them, or convert the capturing groups to non-capturing by add ?: after the opening (, see Extract url domain root in Google Sheet
sed - Why does my regular expression work in X but not in Y?
word-boundarypcrephp - [[:<:]] and [[:>:]] do not work in the regex tester, although they are valid constructs in PCRE, see https://stackoverflow.com/questions/48670105
snowflake-cloud-data-platform snowflake-sql - If you are writing a stored procedure, and \\d does not work, you need to double them again and use \\\\d, see REGEX conversion of VARCHAR value to DATE in Snowflake stored procedure using RLIKE not consistent.

Regular expression that works on dots

I have this regular expression :
string[] values = Regex
.Matches(mystring4, #"([\w-[\d]][\w\s-[\d]]+)|([0-9]+)")
.OfType<Match>()
.Select(match => match.Value.Trim())
.ToArray();
This regular expression turns this string :
MY LIMITED COMPANY (52100000 / 58447000)";
To these strings :
MY LIMITED COMPANY - 52100000 - 58447000
This also works on non-English characters.
But there is one problem, when I have this string : MY. LIMITED. COMPANY. , it splits that too. I don't want that. I don't want that regular expression to work on dots. How can I do that? Thanks.
You may add the dot after each \w in your pattern, and I also suggest removing unnecessary ( and ):
string[] values = Regex
.Matches("MY. LIMITED. COMPANY. (52100000 / 58447000)", #"[\w.-[\d]][\w.\s-[\d]]+|[0-9]+")
.OfType<Match>()
.Select(match => match.Value.Trim())
.ToArray();
foreach (var s in values)
Console.WriteLine(s);
See the C# demo
Pattern:
[\w.-[\d]] - one Unicode letter or underscore ([\w-[\d]]) or a dot (.)
[\w.\s-[\d]]+ - 1 or more (due to + quantifier at the end) characters that are either Unicode letters or underscore, ., or whitespace (\s)
| - or
[0-9]+ - one or more ASCII-only digits
I'd simplify the expression. What if the names in the front include numbers? Not that my solution doesn't exactly mimic the original expression. It will allow numbers in the name part.
Let's start from the beginning:
To match words all you need is a sequence of word characters:
\w+
This will match any alphanumerical characters including underscores (_).
Considering you want the possibility of the word ending with a dot, you can add it and make it optional (one or zero matches):
\w+\.?
Note the escape to make it an actual character rather than a character class "any character".
To match another potential word following, we now simply duplicate this match, add a white space before, and once again make it optional using the * quantifier:
\w+\.?(?:\w+\.?)*
In case you haven't seen a group starting with ?: is a non-matching group. In essence this works like a usual group, but won't save a matching group in your results.
And that's it already. This pattern will split your demo string as expected. Of course there could be other possible characters not being covered by this.
You can see the results of this matching online here and also play around with it.
To test your regular expressions (and to learn them), I'd really recommend you using a tool such as http://regex101.com
It has an input mask allowing you to provide your pattern and your target string. On the right hand side it will first explain the pattern to you (to see if it's indeed what you had in mind) and below it will show all the groups matched. Just keep in mind it actually uses slightly different flavors of regular expressions, but this shouldn't matter for such simple patterns. (I'm not affiliated with that site, just consider it really useful.)
As an alternative, to directly use C#'s regex parser, you can also try this Regex Tester. This works in a similar way, although doesn't include any explanations, which might be not as ideal for someone just getting started.

Extending regular expression syntax to say 'does not contain text XYZ'

I have an app where users can specify regular expressions in a number of places. These are used while running the app to check if text (e.g. URLs and HTML) matches the regexes. Often the users want to be able to say where the text matches ABC and does not match XYZ. To make it easy for them to do this I am thinking of extending regular expression syntax within my app with a way to say 'and does not contain pattern'. Any suggestions on a good way to do this?
My app is written in C# .NET 3.5.
My plan (before I got the awesome answers to this question...)
Currently I'm thinking of using the ¬ character: anything before the ¬ character is a normal regular expression, anything after the ¬ character is a regular expression that can not match in the text to be tested.
So I might use some regexes like this (contrived) example:
on (this|that|these) day(s)?¬(every|all) day(s) ?
Which for example would match 'on this day the man said...' but would not match 'on this day and every day after there will be ...'.
In my code that processes the regex I'll simply split out the two parts of the regex and process them separately, e.g.:
public bool IsMatchExtended(string textToTest, string extendedRegex)
{
int notPosition = extendedRegex.IndexOf('¬');
// Just a normal regex:
if (notPosition==-1)
return Regex.IsMatch(textToTest, extendedRegex);
// Use a positive (normal) regex and a negative one
string positiveRegex = extendedRegex.Substring(0, notPosition);
string negativeRegex = extendedRegex.Substring(notPosition + 1, extendedRegex.Length - notPosition - 1);
return Regex.IsMatch(textToTest, positiveRegex) && !Regex.IsMatch(textToTest, negativeRegex);
}
Any suggestions on a better way to implement such an extension? I'd need to be slightly cleverer about splitting the string on the ¬ character to allow for it to be escaped, so wouldn't just use the simple Substring() splitting above. Anything else to consider?
Alternative plan
In writing this question I also came across this answer which suggests using something like this:
^(?=(?:(?!negative pattern).)*$).*?positive pattern
So I could just advise people to use a pattern like, instead of my original plan, when they want to NOT match certain text.
Would that do the equivalent of my original plan? I think it's quite an expensive way to do it peformance-wise, and since I'm sometimes parsing large html documents this might be an issue, whereas I suppose my original plan would be more performant. Any thoughts (besides the obvious: 'try both and measure them!')?
Possibly pertinent for performance: sometimes there will be several 'words' or a more complex regex that can not be in the text, like (every|all) in my example above but with a few more variations.
Why!?
I know my original approach seems weird, e.g. why not just have two regexes!? But in my particular application administrators provide the regular expressions and it would be rather difficult to give them the ability to provide two regular expressions everywhere they can currently provide one. Much easier in this case to have a syntax for NOT - just trust me on that point.
I have an app that lets administrators define regular expressions at various configuration points. The regular expressions are just used to check if text or URLs match a certain pattern; replacements aren't made and capture groups aren't used. However, often they would like to specify a pattern that says 'where ABC is not in the text'. It's notoriously difficult to do NOT matching in regular expressions, so the usual way is to have two regular expressions: one to specify a pattern that must be matched and one to specify a pattern that must not be matched. If the first is matched and the second is not then the text does match. In my application it would be a lot of work to add the ability to have a second regular expression at each place users can provide one now, so I would like to extend regular expression syntax with a way to say 'and does not contain
pattern'.
You don't need to introduce a new symbol. There already is support for what you need in most regex engines. It's just a matter of learning it and applying it.
You have concerns about performance, but have you tested it? Have you measured and demonstrated those performance problems? It will probably be just fine.
Regex works for many many people, in many many different scenarios. It probably fits your requirements, too.
Also, the complicated regex you found on the other SO question, can be simplified. There are simple expressions for negative and positive lookaheads and lookbehinds.
?! ?<! ?= ?<=
Some examples
Suppose the sample text is <tr valign='top'><td>Albatross</td></tr>
Given the following regex's, these are the results you will see:
tr - match
td - match
^td - no match
^tr - no match
^<tr - match
^<tr>.*</tr> - no match
^<tr.*>.*</tr> - match
^<tr.*>.*</tr>(?<tr>) - match
^<tr.*>.*</tr>(?<!tr>) - no match
^<tr.*>.*</tr>(?<!Albatross) - match
^<tr.*>.*</tr>(?<!.*Albatross.*) - no match
^(?!.*Albatross.*)<tr.*>.*</tr> - no match
Explanations
The first two match because the regex can apply anywhere in the sample (or test) string. The second two do not match, because the ^ says "start at the beginning", and the test string does not begin with td or tr - it starts with a left angle bracket.
The fifth example matches because the test string starts with <tr.
The sixth does not, because it wants the sample string to begin with <tr>, with a closing angle bracket immediately following the tr, but in the actual test string, the opening tr includes the valign attribute, so what follows tr is a space. The 7th regex shows how to allow the space and the attribute with wildcards.
The 8th regex applies a positive lookbehind assertion to the end of the regex, using ?<. It says, match the entire regex only if what immediately precedes the cursor in the test string, matches what's in the parens, following the ?<. In this case, what follows that is tr>. After evaluating ``^.*, the cursor in the test string is positioned at the end of the test string. Therefore, thetr>` is matched against the end of the test string, which evaluates to TRUE. Therefore the positive lookbehind evaluates to true, therefore the overall regex matches.
The ninth example shows how to insert a negative lookbehind assertion, using ?<! . Basically it says "allow the regex to match if what's right behind the cursor at this point, does not match what follows ?<! in the parens, which in this case is tr>. The bit of regex preceding the assertion, ^<tr.*>.*</tr> matches up to and including the end of the string. Because the pattern tr> does match the end of the string. But this is a negative assertion, therefore it evaluates to FALSE, which means the 9th example is NOT a match.
The tenth example uses another negative lookbehind assertion. Basically it says "allow the regex to match if what's right behind the cursor at this point, does not match what's in the parens, in this case Albatross. The bit of regex preceding the assertion, ^<tr.*>.*</tr> matches up to and including the end of the string. Checking "Albatross" against the end of the string yields a negative match, because the test string ends in </tr>. Because the pattern inside the parens of the negative lookbehind does NOT match, that means the negative lookbehind evaluates to TRUE, which means the 10th example is a match.
The 11th example extends the negative lookbehind to include wildcards; in english the result of the negative lookbehind is "only match if the preceding string does not include the word Albatross". In this case the test string DOES include the word, the negative lookbehind evaluates to FALSE, and the 11th regex does not match.
The 12th example uses a negative lookahead assertion. Like lookbehinds, lookaheads are zero-width - they do not move the cursor within the test string for the purposes of string matching. The lookahead in this case, rejects the string right away, because .*Albatross.* matches; because it is a negative lookahead, it evaluates to FALSE, which mean the overall regex fails to match, which means evaluation of the regex against the test string stops there.
example 12 always evaluates to the same boolean value as example 11, but it behaves differently at runtime. In ex 12, the negative check is performed first, at stops immediately. In ex 11, the full regex is applied, and evaluates to TRUE, before the lookbehind assertion is checked. So you can see that there may be performance differences when comparing lookaheads and lookbehinds. Which one is right for you depends on what you are matching on, and the relative complexity of the "positive match" pattern and the "negative match" pattern.
For more on this stuff, read up at http://www.regular-expressions.info/
Or get a regex evaluator tool and try out some tests.
like this tool:
source and binary
You can easily accomplish your objectives using a single regex. Here is an example which demonstrates one way to do it. This regex matches a string containing "cat" AND "lion" AND "tiger", but does NOT contain "dog" OR "wolf" OR "hyena":
if (Regex.IsMatch(text, #"
# Match string containing all of one set of words but none of another.
^ # anchor to start of string.
# Positive look ahead assertions for required substrings.
(?=.*? cat ) # Assert string has: 'cat'.
(?=.*? lion ) # Assert string has: 'lion'.
(?=.*? tiger ) # Assert string has: 'tiger'.
# Negative look ahead assertions for not-allowed substrings.
(?!.*? dog ) # Assert string does not have: 'dog'.
(?!.*? wolf ) # Assert string does not have: 'wolf'.
(?!.*? hyena ) # Assert string does not have: 'hyena'.
",
RegexOptions.Singleline | RegexOptions.IgnoreCase |
RegexOptions.IgnorePatternWhitespace)) {
// Successful match
} else {
// Match attempt failed
}
You can see the needed pattern. When assembling the regex, be sure to run each of the user provided sub-strings through the Regex.escape() method to escape any metacharacters it may contain (i.e. (, ), | etc). Also, the above regex is written in free-spacing mode for readability. Your production regex should NOT use this mode, otherwise whitespace within the user substrings would be ignored.
You may want to add \b word boundaries before and after each "word" in each assertion if the substrings consist of only real words.
Note also that the negative assertion can be made a bit more efficient using the following alternative syntax:
(?!.*?(?:dog|wolf|hyena))

A beginner's guide to interpreting Regex?

Greetings.
I've been tasked with debugging part of an application that involves a Regex -- but, I have never dealt with Regex before. Two questions:
1) I know that the regexes are supposed to be testing whether or not two strings are equivalent, but what specifically do the two regex statements, below, mean in plain English?
2) Does anyone have a recommendation on websites / sources where I can learn more about Regexes? (preferably in C#)
if (Regex.IsMatch(testString, #"^(\s*?)(" + tag + #")(\s*?),", RegexOptions.IgnoreCase))
{
result = true;
}
else if (Regex.IsMatch(testString, #",(\s*?)(" + tag + #")(\s*?),", RegexOptions.IgnoreCase))
{
result = true;
}
It's going to be difficult to tell what that regex means, without knowing what's in tag. In fact, it looks like that regex is broken (or, at least, doesn't properly escape inputs).
Roughly speaking, for the first regex:
The ^ says to match at the beginning of the string.
The (...) sets up a capturing group (which is available, although this example apparently doesn't use it).
The \s matches any white space characters (spaces, tabs, etc.)
The *? matches zero or more of the previous character (in this case, whitespace), and because it has a question-mark, it matches the minimum number of characters needed to make the rest of the expression work.
The (" + tag + #") inserts the contents of the tag into the regex. As I mention, that's dangerous, without escaping.
The (\s*?) matches the same as the before (the minimum number of whitespace characters)
The , matches a trailing comma.
The second regex is very similar, but looks for a starting comma (rather than the beginning of the string).
I like the Python documentation for Regular Expressions, but it looks like this site
has a pretty good, basic introduction, with C# examples.
One word - Cribsheet (or is that two?) :)
I'm not c# savvy but I can recommend an awesome guide to regular expressions that I use for Bash and Java programming. It applies to pretty much all languages:
http://www.amazon.com/Mastering-Regular-Expressions-Jeffrey-Friedl/dp/0596528124/ref=tmm_pap_title_0
It is totally worth $30 to own this book. It is VERY thorough and helped my fundamental understanding of Regex a lot.
-Ryan
Since you specifically tagged C#, I recommend the Regex Hero as a tool you can use to play around with them since it's running on .NET. It also lets you toggle the different RegexOptions flags as you would pass them into the constructor when creating a new Regex.
Also, if you're using a version of Visual Studio 2010 that supports extensions, I would take a look at the Regex Editor extension... it will popup whenever you type new Regex( and offer you some guidance and autocomplete for your regex pattern.
Using The Regex Coach
The regular expression is a sequence consisting of the expression '(\s*?)', the expression '(tag)', the expression '(\s*?)', and the character ','.
where (\s*?) is defined as The regular expression is a repetition which matches a whitespace character as often as necessary.
the second one matches a , at the start too
As for good learning websites, I like www.regular-expressions.info/
Super simple version:
At the start of a string 0 or more spaces, whatever Tag is, 0 or More spaces, a comma.
the second one is
a comma, 0 or more spaces, whatever Tag is, 0 or More spaces, a comma.
Once you have the very basic idea about regex (it's full of resources over there) I recommend you to use Expresso for creating your regular expressions.
Expresso editor is equally suitable as a teaching tool for the beginning user of regular expressions or as a full-featured development environment for the experienced programmer or web designer with an extensive knowledge of regular expressions.
Your premise is not correct. Regular expressions are not used to tell if two strings are equivalent, but rather if the input string matches a certain pattern.
The first test above looks for any text that does not contain "zero or more whitespace charaters" searching "non-greedy". Then matches the text of the variable "tag" in the middle, then "zero or more whitespace characters, non greedy" again.
The second one is very similar, except that it allows for beginning whitespace as long as it follows a comma.
It is hard to explain "non-greedy" in this context, especially involving whitespace characters, so look here for more information.
A regular expression is a way to describe a set of strings that have some particular characteristics.
They don't merely need just to compare two strings.. what you usually do it to test if a string matches a particular regular expression. They can also be used to do simple parsing of a string in tokens that respect some patterns..
The good thing about regexps is that they allow you to express certain constraints inside a string keeping it general and able to match a group of strings that respect those constraints.. then they follow a formal specification that doesn't leave ambiguities around..
Here you can find a comparison table of various regular expression languages in many different programming languages and a specific guide for C# if you follow its link.
Usually the implementations for the various languages are quite similar since the syntax is somewhat standardized from the theoretical topics regexps come from, so any tutorial about regexp will be fine, then you'll just need to get into C# API.
1) The first regex is trying to do a case-insensitive match starting at the beginning of the test string. It then matches optional whitespace, followed by whatever is in tag, followed by optional whitespace then finally a comma.
The second matches a string containing a comma, followed by optional whitespace, followed by whatever is in tag, followed by optional whitespace then finally a comma.
Thought it's for C# I recommend picking up the Perl Pocket Reference which has a great Regex syntax reference. It helped my out a lot when I was learning regexes 14 years ago.
http://www.myregextester.com/ is a decent regular expression tester that also has an explain option for C# regexps - For Instance check out this example:
The regular expression:
(?-imsx:^(\s*?)(tagtext)(\s*?),)
matches as follows:
NODE EXPLANATION
----------------------------------------------------------------------
(?-imsx: group, but do not capture (case-sensitive)
(with ^ and $ matching normally) (with . not
matching \n) (matching whitespace and #
normally):
----------------------------------------------------------------------
^ the beginning of the string
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
\s*? whitespace (\n, \r, \t, \f, and " ") (0
or more times (matching the least amount
possible))
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
( group and capture to \2:
----------------------------------------------------------------------
tagtext 'tagtext'
----------------------------------------------------------------------
) end of \2
----------------------------------------------------------------------
( group and capture to \3:
----------------------------------------------------------------------
\s*? whitespace (\n, \r, \t, \f, and " ") (0
or more times (matching the least amount
possible))
----------------------------------------------------------------------
) end of \3
----------------------------------------------------------------------
, ','
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
A regular expression does not tell you if two strings match, but rather if a given string matches a pattern.
This site is my favorite for learning and testing regular expressions:
http://gskinner.com/RegExr/
It allows you to interactively test regular expressions as you write them, and provides a built-in tutorial.
Although it doesn't use C#, Rejex is a simple tool for testing and learning about regular expressions which includes a quick reference for the special characters
It looks like that they are trying to match some kind of list of words delimited by colons (UPDATE: commas).
The first one is probably matching first item and the second one some item after the first one excluding the last one. I hope you will understand :).
A good source of information about regular expressions is at http://www.regular-expressions.info/
also a great site to test your regular expressions with extra info: http://regex101.com/

Regular Expression to detect repetition within a string

Is it possible to detect repeated number patterns with a regular expression?
So for example, if I had the following string "034503450345", would it be possible to match the repeated sequence 0345? I have a feeling this is beyond the scope of regex, but I thought I would ask here anyway to see if I have missed something.
This expression will match one or more repeating groups:
(.+)(?=\1+)
Here is the same expression broken down, (using commenting so it can still be used directly as a regex).
(?x) # enable regex comment mode
( # start capturing group
.+ # one or more of any character (excludes newlines by default)
) # end capturing group
(?= # begin lookahead
\1+ # match one or more of the first capturing group
) # end lookahead
To match a specific pattern, change the .+ to that pattern, e.g. \d+ for one or more numbers, or \d{4,} to match 4 or more numbers.
To match a specific number of the pattern, change \1+, e.g to \1{4} for four repetitions.
To allow the repetition to not be next to each other, you can add .*? inside the lookahead.
Yes, you can - here's a Python test case
import re
print re.search(r"(\d+).*\1", "8034503450345").group(1)
# Prints 0345
The regular expression says "find some sequence of digits, then any amount of other stuff, then the same sequence again."
On a barely-related note, here's one of my favourite regular expressions - a prime number detector:
import re
for i in range(2, 100):
if not re.search(r"^(xx+)\1+$", "x"*i):
print i
Just to add a note to the (correct) answer from RichieHindle:
Note that while Python's regexp implementation (and many others, such as Perl's) can do this, this is no longer a regular expression in the narrow sense of the word.
Your example is not a regular language, hence cannot be handled by a pure regular expression. See e.g. the excellent Wikipedia article for details.
While this is mostly only of academic interest, there are some practical consequences. Real regular expressions can make much better guarantees for maximum runtimes than in this case. So you could get performance problems at some point.
Not to say that it's not a good solution, but you should realize that you're at the limit of what regular expressions (even in extended form) are capable of, and might want to consider other solutions in case of problems.
This is the C# code, that uses the backreference construct to find repeated digits. It will work with 034503450345, 123034503450345, 034503450345345, 232034503450345423. The regex is much easier and clearer to understand.
/// <summary>
/// Assigns repeated digits to repeatedDigits, if the digitSequence matches the pattern
/// </summary>
/// <returns>true if success, false otherwise</returns>
public static bool TryGetRepeatedDigits(string digitSequence, out string repeatedDigits)
{
repeatedDigits = null;
string pattern = #"^\d*(?<repeat>\d+)\k<repeat>+\d*$";
if (Regex.IsMatch(digitSequence, pattern))
{
Regex r = new Regex(pattern, RegexOptions.IgnoreCase | RegexOptions.Compiled);
repeatedDigits = r.Match(digitSequence).Result("${repeat}");
return true;
}
else
return false;
}
Use regex repetition:
bar{2,}
looks for text with two or more bar:
barbar
barbarbar
...

Categories