orginal question removed
I am looking for a Regular Expression which will format a string containing of special characters, characters and numbers into a string containing only numbers.
There are special cases in which it’s not enough to only replace all non-numeric characters with “” (empty).
1.) Zero in brackets.
If there are only zeros in a bracket (0) these should be removed if it is the first bracket pair. (The second bracket pair containing only zeros should not be removed)
2.) Leading zero.
All leading zero should be removed (ignoring brackets)
Examples for better understanding:
123 (0) 123 would be 123123 (zero removed)
(0) 123 -123 would be 123123(zero and all other non-numeric characters removed)
2(0) 123 (0) would be 21230 (first zero in brackets removed)
20(0)123023(0) would be 201230230 (first zero in brackets removed)
00(0)1 would be 1(leading zeros removed)
001(1)(0) would be 110 (leading zeros removed)
0(0)02(0) would be 20 (leading zeros removed)
123(1)3 would be 12313 (characters removed)
You could use a lookbehind to match (0) only if it's not at the beginning of the string, and replace with empty string as you're doing.
(original solution removed)
Updated again to reflect new requirements
Matches leading zeroes, matches (0) only if it's the first parenthesized item, and matches any non-digit characters:
^[0\D]+|(?<=^[^(]*)\(0\)|\D
Note that most regex engines do not support variable-length lookbehinds (i.e., the use of quantifiers like *), so this will only work in a few regex engines -- .NET's being one of them.
^[0\D]+ # zeroes and non-digits at start of string
| # or
(?<=^[^(]*) # preceded by start of string and only non-"(" chars
\(0\) # "(0)"
| # or
\D # non-digit, equivalent to "[^\d]"
(tested at regexhero.net)
You've changed and added requirements several times now. For multiple rules like this, you're probably better off coding for them individually. It could become complicated and difficult to debug if one condition matches and causes another condition not to match when it should. For example, in separate steps:
Remove parenthesized items as necessary.
Remove non-digit characters.
Remove leading zeroes.
But if you absolutely need these three conditions all matched in a single regular expression (not recommended), here it is.
Regexes get much, much simpler if you can use multiple passes. I think you could do a first pass to drop your (0) if it's not the first thing in a string, then follow it with stripping out the non-digits:
var noMidStrParenZero = Regex.Replace(text, "^([^(]+)\(0\)", "$1");
var finalStr = Regex.Replace(noMidStrParenZero, "[^0-9]", "");
Avoids a lot of regex craziness, and it's also self-documenting to an extent.
EDIT: this version should work with your new examples too.
This regex should be pretty near the one you're searching for.
(^[^\d])|([^\d](0[^\d])?)+
(You can replace everything that is caught by an empty string)
EDIT :
Your request evolved, and is now to complex to be treatd with a single pass. Assuming you always got a space before a bracket group, you can use those passes (keep this order) :
string[] entries = new string[7] {
"800 (0) 123 - 1",
"800 (1) 123",
"(0)321 123",
"1 (0) 1",
"1 (12) (0) 1",
"1 (0) (0) 1",
"(9)156 (1) (0)"
};
foreach (string entry in entries)
{
var output = Regex.Replace(entry , #"\(0\)\s*\(0\)", "0");
output = Regex.Replace(output, #"\s\(0\)", "");
output = Regex.Replace(output, #"[^\d]", "");
System.Console.WriteLine("---");
System.Console.WriteLine(entry);
System.Console.WriteLine(output);
}
(?: # start grouping
^ # start of string
| # OR
^\( # start of string followed by paren
| # OR
\d # a digit
) # end grouping
(0+) # capture any number of zeros
| # OR
([1-9]) # capture any non-zero digit
This works for all of your example strings, but the entire expression does match the ( followed by the zero. You can use Regex.Matches to get the match collection using a global match and then join all of the matched groups into a string to get numbers only (or just remove any non-numbers).
Related
I have Text as below
1. This is 678 897 999
not a text which I want
2. This is 678 897 879
I have applied regex as
This\s*is\s*(\s+\d+){1,}(?: ){0,}[\r\n]+
Now what I want is match a string which does not have not next to the match string. I don't want the regex to match first string.
EDIT
Suppose I have 2 string as above and I applied regex then I found 2 match
This is 678 897 999
This is 678 897 879
Upto this all is perfect but now I want regex which does not contain not(in first string), I want to match only 2nd string.
This\s*is\s*(\s+\d+){1,}(?: ){0,}(?:[\r\n]+|$)(?!not)
Just add lookahead.See demo.
https://regex101.com/r/eB8xU8/8
I want regex which does not contain not(in first string), I want to match only 2nd string.
That means you should check if the This is... pattern is not followed by newline sequence + spaces* + not as a whole word with backtracking disabled. We can disable backtracking using atomic group in .NET:
(?>This\s+is(?:\s+\d+)+ *)(?![\r\n]+\p{Zs}*not\b)
See the regex demo
Part 1 of the regex This\s+is(?:\s+\d+)+ * matches This is followed with one or more sequences of one or more whitespaces followed with one or more digits, then followed with zero or more spaces. The (?>...) prevent backtracking inside this part of the pattern. The lookahead (?![\r\n]+\p{Zs}*not\b) fails the match if the previously matched text is followed with the whitespaces followed with a whole word not (where \b stands for a word boundary).
This question may sound stupid. But I have tried several options and none of them worked.
I have the following in a string variable.
string myText="*someText*someAnotherText*";
What I mean by above is that, there can be 0 or more characters before "someText". There can be 0 or more characters after "someText" and before "someAnotherText". Finally, there can be 0 or more occurrences of any character before "someAnotherText".
I tried the following.
string res= Regex.Replace(searchFor.ToLower(), "*", #"\S*");
It didn't work.
Then I tried the following.
string res= Regex.Replace(searchFor.ToLower(), "*", #"\*");
Even that didn't work.
Can someone help pls ?
Even though I have mentioned "*" to indicate 0 or more occurrences, it says that I haven't mentioned the number of occurrences.
Unlike the DOS wildcard character, the * character in a regular expression means repeat the previous item (character, group, whatever) 0 or more times. In your regular expression the first * has no preceding character, the second one follows the t character, so will repeat that any number of times.
To get '0 or more of any character' you need to use the composition .* where . is 'any character' and * is '0 or more times'.
In other words to search for someText followed any number of characters later by someAnotherText you would use the following Regex:
var re = new Regex(#"someText.*someAnotherText");
Note that unless you specify otherwise by putting start/end specifiers in (^ for start of string, $ for end) the Regex will match any substring of the test string.
Tests for the above, all returning true:
re.IsMatch("This is someText, followed by someAnotherText with text after.");
re.IsMatch("someTextsomeAnotherText");
re.IsMatch("start:someTextsomAnotherText:end");
And so on.
In Regex terms * is a quantifier. Other quantifiers are:
? Match 0 or 1
+ Match 1 or more
{n} Match 'n' times
{n,} Match at least 'n' times
{n,m} Match 'n' to 'm' times
All apply to the preceding term in the Regex.
Placing a ? after another quantifier (including ?) will convert it to lazy form, where it will match as few items as it can. This will allow following terms to also match the terms you specified.
The regular expression to match 0 or more occurrences of any character is
.*
where . matches any single character and * matches zero or more occurrences of it.
(This answer is a quick reference simplification of the current answer.)
I have one strange issue on my .NET project with RegEx. Please, see C# code below:
const string PATTERN = #"^[a-zA-Z]([-\s\.a-zA-Z]*('(?!'))?[-\s\.a-zA-Z]*)*$";
const string VALUE = "Ingebrigtsen Myre (Øvre)";
System.Text.RegularExpressions.Regex regex = new System.Text.RegularExpressions.Regex(PATTERN);
if (!regex.IsMatch(VALUE)) // <--- Infinite loop here
return string.Empty;
// Some other code
I use this pattern to validate all types of names (fist names, last names, middle names, etc.). Value is a parameter, but I provided it as a constant above, because issue is not reproduced often - only with special symbols: *, (, ), etc. (sorry, but I don't have the full list of these symbols).
Can you help me to fix this infinite loop? Thanks for any help.
Added: this code is placed on the very base level of project and I don't want to do any refactoring there - I just want to have quick fix for this issue.
Added 2: I do know that it technically is not a loop - I meant that "regex.IsMatch(VALUE)" never ends. I waited for about an hour and it was still executing.
Your non-trivial regex: ^[a-zA-Z]([-\s\.a-zA-Z]*('(?!'))?[-\s\.a-zA-Z]*)*$, is better written with comments in free-spacing mode like so:
Regex re_orig = new Regex(#"
^ # Anchor to start of string.
[a-zA-Z] # First char must be letter.
( # $1: Zero or more additional parts.
[-\s\.a-zA-Z]* # Zero or more valid name chars.
( # $2: optional quote.
' # Allow quote but only
(?!') # if not followed by quote.
)? # End $2: optional quote.
[-\s\.a-zA-Z]* # Zero or more valid name chars.
)* # End $1: Zero or more additional parts.
$ # Anchor to end of string.
",RegexOptions.IgnorePatternWhitespace);
In English, this regex essentially says: "Match a string that begins with an alpha letter [a-zA-Z] followed by zero or more alpha letters, whitespaces, periods, hyphens or single quotes, but each single quote may not be immediately followed by another single quote."
Note that your above regex allows oddball names such as: "ABC---...'... -.-.XYZ " which may or may not be what you need. It also allows multi-line input and strings that end with whitespace.
The "infinite loop" problem with the above regex is that catastrophic backtracking occurs when this regex is applied to a long invalid input which contains two single quotes in a row. Here is an equivalent pattern which matches (and fails to match) the exact same strings, but does not experience catastrophic backtracking:
Regex re_fixed = new Regex(#"
^ # Anchor to start of string.
[a-zA-Z] # First char must be letter.
[-\s.a-zA-Z]* # Zero or more valid name chars.
(?: # Zero or more isolated single quotes.
' # Allow single quote but only
(?!') # if not followed by single quote.
[-\s.a-zA-Z]* # Zero or more valid name chars.
)* # Zero or more isolated single quotes.
$ # Anchor to end of string.
",RegexOptions.IgnorePatternWhitespace);
And here it is in short form in your code context:
const string PATTERN = #"^[a-zA-Z][-\s.a-zA-Z]*(?:'(?!')[-\s.a-zA-Z]*)*$";
Look at this part of your regex:
( [-\s\.a-zA-Z]* ('(?!'))? [-\s\.a-zA-Z]* )*$
^ ^ ^ ^ ^
| | | | |
| | | | This group repeats any number of times
| | | charclass repeats any number of times
| | This group is optional
| This character class also repeats any number of times
Outer group (repeated, as seen above)
That means that as soon as your input string contains a character that's not in the character class (like the brackets and non-ASCII letter in your example), the preceding characters will be tried in a lot of permutations whose number increases exponentially with the length of the string.
To avoid that (and to allow a faster failure of the regex, use atomic groups:
const string PATTERN = #"^[a-zA-Z](?>(?>[-\s\.a-zA-Z]*)(?>'(?!'))?(?>[-\s\.a-zA-Z])*)*$";
You've got an "any number of any number" here:
...[-\s\.a-zA-Z]*)*
and because your input doesn't match, the engine backtracks to try all permutations of dividing the input up, and the number of attempts grows exponentially with the length of the input.
You can fix it simply by adding a "+" to make a possessive quantifier, which once consumed will not backtrack to find other combinations:
const string PATTERN = #"^[a-zA-Z]([-\s\.a-zA-Z]*('(?!'))?[-\s\.a-zA-Z]*+)*$";
^-- added + here
You can see a live demo (on rubular) demonstrating that adding the plus fixed the loop problem, and still matches input that doesn't have the odd characters.
I need to match a string under the following conditions using Regex in C#:
Entire string can only be alphanumeric (including spaces).
Must be a maximum of 15 characters or less (including spaces).
First & last characters can only be a letter.
A single space can appear multiple times in anywhere but the first and last characters of the string. (Multiple spaces together should not be allowed).
Capitalization should be ignored.
Should match the WHOLE word(s).
If any one of these preconditions are broken, a match should not follow.
Here is what i currently have:
^\b([A-z]{1})(([A-z0-9 ])*([A-z]{1}))?\b$
And here are some test strings that should match:
Stack OverFlow
Iamthe greatest
A
superman23s
One Two Three
And some that shouldn't match (note the spaces):
Stack [double_space] Overflow Rocks
23Hello
ThisIsOver15CharactersLong
Hello23
[space_here]hey
etc.
Any suggestions would be much appreciated.
You should use lookaheads
|->matches if all the lookaheads are true
--
^(?=[a-zA-Z]([a-zA-Z\d\s]+[a-zA-Z])?$)(?=.{1,15}$)(?!.*\s{2,}).*$
-------------------------------------- ---------- ----------
| | |->checks if there are no two or more space occuring
| |->checks if the string is between 1 to 15 chars
|->checks if the string starts with alphabet followed by 1 to many requireds chars and that ends with a char that is not space
you can try it here
Try this regex: -
"^([a-zA-Z]([ ](?=[a-zA-Z0-9])|[a-zA-Z0-9]){0,13}[a-zA-Z])$"
Explanation : -
[a-zA-Z] // Match first character letter
( // Capture group
[ ](?=[a-zA-Z0-9]) // Match either a `space followed by non-whitespace` (Avoid double space, but accept single whitespace)
| // or
[a-zA-Z0-9] // just `non-whitespace` characters
){0,13} // from `0 to 13` character in length
[a-zA-Z] // Match last character letter
Update : -
To handle single characters, you can make the pattern after 1st character optional as nicely pointed by #Rawling in comments: -
"^([a-zA-Z](([ ](?=[a-zA-Z0-9])|[a-zA-Z0-9]){0,13}[a-zA-Z])?)$"
^^^ ^^^
use a capture group make it optional
And my version, again using look-aheads:
^(?=.{1,15}$)(?=^[A-Z].*)(?=.*[A-Z]$)(?![ ]{2})[A-Z0-9 ]+$
explained:
^ start of string
(?=.{1,15}$) positive look-ahead: must be between 1 and 15 chars
(?=^[A-Z].*) positive look-ahead: initial char must be alpha
(?=.*[A-Z]$) positive look-ahead: last char must be alpha
(?![ ]{2}) negative look-ahead: string mustn't contain 2 or more consecutive spaces
[A-Z0-9 ]+ if all the look-aheads agree, select only alpha-numeric chars + space
$ end of string
This will also need the IgnoreCase option setting
Guys I hate Regex and I suck at writing.
I have a string that is space separated and contains several codes that I need to pull out. Each code is marked by beginning with a capital letter and ending with a number. The code is only two digits.
I'm trying to create an array of strings from the initial string and I can't get the regular expression right.
Here is what I have
String[] test = Regex.Split(originalText, "([a-zA-Z0-9]{2})");
I also tried:
String[] test = Regex.Split(originalText, "([A-Z]{1}[0-9]{1})");
I don't have any experience with Regex as I try to avoid writing them whenever possible.
Anyone have any suggestions?
Example input:
AA2410 F7 A4 Y7 B7 A 0715 0836 E0.M80
I need to pull out F7, A4, B7. E0 should be ignored.
You want to collect the results, not split on them, right?
Regex regexObj = new Regex(#"\b[A-Z][0-9]\b");
allMatchResults = regexObj.Matches(subjectString);
should do this. The \bs are word boundaries, making sure that only entire strings (like A1) are extracted, not substrings (like the A1 in TWA101).
If you also need to exclude "words" with non-word characters in them (like E0.M80 in your comment), you need to define your own word boundary, for example:
Regex regexObj = new Regex(#"(?<=^|\s)[A-Z][0-9](?=\s|$)");
Now A1 only matches when surrounded by whitespace (or start/end-of-string positions).
Explanation:
(?<= # Assert that we can match the following before the current position:
^ # Start of string
| # or
\s # whitespace.
)
[A-Z] # Match an uppercase ASCII letter
[0-9] # Match an ASCII digit
(?= # Assert that we can match the following after the current position:
\s # Whitespace
| # or
$ # end of string.
)
If you also need to find non-ASCII letters/digits, you can use
\p{Lu}\p{N}
instead of [A-Z][0-9]. This finds all uppercase Unicode letters and Unicode digits (like Ä٣), but I guess that's not really what you're after, is it?
Do you mean that each code looks like "A00"?
Then this is the regex:
"[A-Z][0-9][0-9]"
Very simple... By the way, there's no point writing {1} in a regex. [0-9]{1} means "match exactly one digit, which is exactly like writing [0-9].
Don't give up, simple regexes make perfect sense.
This should be ok:
String[] all_codes = Regex.Split(originalText, #"\b[A-Z]\d\b");
It gives you an array with all code starting with a capital letter followed by a digit, separated by an kind of word boundary (site space etc.)