Regex group matching without word boundaries - c#

I am trying to create a function that returns true when the string doesn't have a particular group of chars (in this example the group is "DontMatchMe")
so, of the following examples:
example1
examDontMatchMeple2
example3
examDontMatchMeple4
example4
valid matches are:
example1
example3
example4
my first option was to use the pattern .*(?!DontMatchMe).* but .* is consuming everything, the match is always true.
Note that the values on the string I am actually using are random. I cannot use "exe" to build the regex, for example. the "DontMatchMe" is also random.

In order to exclude a specific word, you can use a pattern like this: ^(?!.*DontMatchMe).+
To avoid the issue with .* consuming everything you can anchor the pattern to the beginning of the string. The pattern break-down is as follows:
^: anchor to the beginning of the string
(?!.*DontMatchMe): negative look-ahead that matches any character and the text to be ignored
.+: finally, match one or more characters (which would happen as long as the look-ahead didn't match anything)
Example:
string[] inputs =
{
"example1",
"examDontMatchMeple2",
"example3",
"examDontMatchMeple4",
"example4"
};
string ignoreText = "DontMatchMe";
string pattern = String.Format("^(?!.*{0}).+", Regex.Escape(ignoreText));
foreach (var input in inputs)
{
Console.WriteLine("{0}: {1}", input, Regex.IsMatch(input, pattern));
}

If it's a simple non-regex string that you want to check for, you can simply use the Contains method and invert the result:
bool doesNotContain(string s, string group) {
// error check for nulls first (not included here)
return !s.Contains(group);
}
If you want your group to possibly be a regex, you can still use the same principle. Look for the pattern you don't want, and if it's there return false, otherwise return true. This is probably easier to read and understand, particularly for people not familiar with the more advanced concepts of regular expressions, like negative lookaheads.

Related

How to match any repeated chunks of characters?

I've seen many questions similar to this but none quite like it.
I have strings like this:
HF-01-HF-01-01
FBC-FBC-04
OZYA-03A-OZYA-03A-03
QC-QC-02
and want them to be returned like so:
HF-01-01
FBC-04
OZYA-03A-03
QC-02
I can't figure this out and the other questions I've seen don't apply because 1) the repeated chunk is more than one character, 2) There are no spaces between the repetition.
Or is regex not the best way to do this?
EDIT:
Rules
Alpha chunks are never repeated more than one time.
Some chunks can be alphanumeric but also never repeated more than one
time.
The part that can be repeated would be from the start of the string
and any additional chunks by hyphen.
So you would never have something like HF-HF-01-01. But in this case using the above rules, it would become HF-01-01 since HF is the only part repeated from the beginning of the string.
Perhaps something like this would work:
Scan string to first hyphen, see if that matches anywhere else after first hyphen, if so scan to second hyphen, see if that matches anywhere else, if not, take the first scan and remove one instance of it from the string, if so, scan to third, etc.
But I don't know how to do that in regex.
I'm not sure if RegExp is the right tool here.
Using MoreLinq RunLengthEncode method (that implement R.L.E.) you can achieve it like this:
string RemoveDuplicate(string input)
{
var chunks = input.Split('-') // cut at -
.RunLengthEncode() // group and count adjacent equals chunck
.Select(kvp => kvp.Key);// just take the chunk value
return string.Join("-", chunks); // reglue with -
}
Edit
Doesn't work for:
OZYA-03A-OZYA-03A-03
I guess,
([^-\r\n]+-|[^-\r\n]+-[^-\r\n]+-)(\1.*)
or with start/end anchors,
^([^-\r\n]+-|[^-\r\n]+-[^-\r\n]+-)(\1.*)$
might work to some extent and the desired output is in the last capturing group:
(\1.*)
RegEx Demo 1
RegEx Demo 2
Test
using System;
using System.Text.RegularExpressions;
public class Example
{
public static void Main()
{
string pattern = #"([^-\r\n]+-|[^-\r\n]+-[^-\r\n]+-)(\1.*)";
string input = #"HF-01-HF-01-01
FBC-FBC-04
OZYA-03A-OZYA-03A-03
QC-QC-02
and want them to be returned like so:
HF-01-01
FBC-04
OZYA-03A-03
QC-02";
RegexOptions options = RegexOptions.Multiline;
foreach (Match m in Regex.Matches(input, pattern, options))
{
Console.WriteLine("'{0}' found at index {1}.", m.Value, m.Index);
}
}
}
If you wish to simplify/modify/explore the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.
RegEx Circuit
jex.im visualizes regular expressions:
I'm not sure if regex is the right tool here, but atleast it can be somewhat done with this short pattern:
^([A-Z0-9]+)-.*(\1.*)$
Explanation:
^ start of string
( group 1 start
[A-Z0-9]+ one or more capital letters or digits
) end group 1
- literal
.* any number of any chars
( group 2 start
\1 anything that was matched in group 1
.* any number of any chars
) end group 2 (this group will be used as the result)
$ end of string

How to use regex to match anything from A to B, where B is not preceeded by C

I'm having a hard time with this one. First off, here is the difficult part of the string I'm matching against:
"a \"b\" c"
What I want to extract from this is the following:
a \"b\" c
Of course, this is just a substring from a larger string, but everything else works as expected. The problem is making the regex ignore the quotes that are escaped with a backslash.
I've looked into various ways of doing it, but nothing has gotten me the correct results. My most recent attempt looks like this:
"((\"|[^"])+?)"
In various test online, this works the way it should - but when I build my ASP.NET page, it cuts off at the first ", leaving me with just the a-letter, white space and a backslash.
The logic behind the pattern above is to capture all instances of \" or something that is not ". I was hoping this would search for \", making sure to find those first - but I got the feeling that this is overridden by the second part of the expression, which is only 1 single character. A single backslash does not match 2 characters (\"), but it will match as a non-". And from there, the next character will be a single ", and the matching is completed. (This is just my hypothesis on why my pattern is failing.)
Any pointers on this one? I have tried various combinations with "look"-methods in regex, but I didn't really get anywhere. I also get the feeling that is what I need.
ORIGINAL ANSWER
To match a string like a \"b\" c, you need to use following regex declaration:
(?:\\"|[^"])+
var rx = Regex(#"(?:\\""|[^""])+");
See RegexStorm demo
Here is an IDEONE demo:
var str = "a \\\"b\\\" c";
Console.WriteLine(str);
var rx = new Regex(#"(?:\\""|[^""])+");
Console.WriteLine(rx.Match(str).Value);
Please note the # in front of the string literal that lets us use verbatim string literals where we have to double quotes to match literal quotes and use single escape slashes instead of double. This makes regexps easier to read and maintain.
If you want to match any escaped entities in your input string, you can use:
var rx = new Regex(#"[^""\\]*(?:\\.[^""\\]*)*");
See demo on RegexStorm
UPDATE
To match the quoted strings, just add quotes around the pattern:
var rx = new Regex(#"""(?<res>[^""\\]*(?:\\.[^""\\]*)*)""");
This pattern yields much better performance than Tim Long's suggested regex, see RegexHero test resuls:
The following expression worked for me:
"(?<Result>(\\"|.)*)"
The expression matches as follows:
An opening quote (literal ")
A named capture (?<name>pattern) consisting of:
Zero or more occurences * of literal \" or (|) any single character (.)
A final closing quote (literal ")
Note that the * (zero or more) quantifier is non-greedy so the final quote is matched by the literal " and not the "any single character" . part.
I used ReSharper 9's built-in Regular Expression validator to develop the expression and verify the results:
I have used the "Explicit Capture" option to reduce cruft in the output (RegexOptions.ExplicitCapture).
One thing to note is that I am matching the whole string, but I am only capturing the substring, using a named capture. Using named captures is a really useful way to get at the results you want. In code, it might look something like this:
static string MatchQuotedString(string input)
{
const string pattern = #"""(?<Result>(\\""|.)*)""";
const RegexOptions options = RegexOptions.ExplicitCapture;
Regex regex = new Regex(pattern, options);
var matches = regex.Match(input);
var substring = matches.Groups["Result"].Value;
return substring;
}
Optimization: If you are planning on using the regex a lot, you could factor it out into a field and use the RegexOptions.Compiled option, this pre-compiles the expression and gives you faster throughput at the expense of longer initialization.

Regex replace all matching words that do not contain a certain string

How can I use regex to replace matching strings that do not include a specific string?
input string
Keepword mywordsecond mythirdword myfourthwordKeep
string to replace
word
exclude string
Keep
Desired out put
Keepword mysecond mythird myfourthKeep
Will there ever be more than one word in a word? If there are more than one, do you want to replace all of them? If not, this should sort you out:
Regex r = new Regex(#"\b((?:(?!Keep|word)\w)*)word((?:(?!Keep)\w)*)\b");
s1 = r.Replace(s0, "$1$2");
to explain:
First, \b((?:(?!Keep|word)\w)*) captures whatever text precedes the first occurrence of word or Keep.
The next thing it sees must be word, If it sees Keep or the end of the string instead, the match attempt immediately fails.
Then ((?:(?!Keep)\w)*)\b captures the remainder of the text in order to ensure it doesn't contain Keep.
When faced with a problem like this, most users' first impulse is to match (in the sense of consuming) only the part of the string they're interested in, using lookarounds to establish the context. It's usually much easier to write the regex so that it always moves forward through the string as it matches. You capture the parts you want to retain so you can plug them back into the result string by means of group references ($1, $2, etc.).
Given that you're using C#, you could use the lookaround approach:
Regex r = new Regex(#"(?<!Keep\w*)word(?!\w*Keep)");
s1 = r.Replace(s0, "");
But please don't. There are very few regex flavors that support unrestricted lookbehinds like .NET does, and most problems don't work so neatly as this one anyway.
string str = "Keepword mywordsecond mythirdword myfourthwordKeep";
str = Regex.Replace(str, "(?<!Keep)word", "");
And I'm going to link you to a one of good Regular Expressions Cheat sheet here
This works in notepad++:
(?<!Keep)word(?!Keep)
It uses "look ahead".
You can use negative look-behind assertion if you want to remove all "word" that are not proceeded by "Keep":
String input = "Keepword mywordsecond mythirdword myfourthwordKeep";
String pattern = "(?<!Keep)word";
String output = Regex.Replace(input, pattern, "");

Regular expression only one character and 7 numbers

i want regex match only one char in any position of word and 7 numbers
match example:
1111111q
2222222q
111e1111
11e11111
i do this pattern but not working in all patterns:
[A-Za-z][0-9]{7}
Regular expressions match patterns. In your case, it would seem that the letter can be at any point in your string, which would mean that you would have a multitude of patterns which would need to be taken into consideration.
I think that for this case, you should not use regular expressions for simplicity's sake. I would recommend you take a look at the Char.isDigit(Char c) and Char.isLetter(Char c) methods and use counters to see that the string is in the format you are after.
there are readily available methods in C# for checking the conditions you want. I would use Regex if there is no parser or simple c# solution.
I would do like below
var str = "1111111u";
var isValid = str.Length ==8 &&
str.Where(char.IsDigit).Count() ==7 &&
str.Where(char.IsLetter).Count() ==1;
It is not that difficult in regex:
If the complete string has to match just use:
^(?=.{8}$)\d*[a-zA-Z]\d*$
See it here on regexr.
If this is a word in a larger text use:
\b(?=[a-z0-9]{8}\b)\d*[a-z]\d*\b
See it here on Regexr
\d*[a-z]\d* matches any amount of digits, followed by one letter, then again any amount of digits.
(?=[a-z0-9]{8} is a positive lookahead assertion, this ensures the length of 8 in total.
Important here is the use of anchors or word boundaries to avoid partial wrong matches.
If you really want to match any letter then use the Unicode property \p{L} instead of the character class:
^(?=.{8}$)\d*\p{L}\d*$
I can only come up with a "brute force" regex method:
foundMatch = Regex.IsMatch(subjectString,
#"\b
(?:[a-z]\d{7}|
\d[a-z]\d{6}|
\d{2}[a-z]\d{5}|
\d{3}[a-z]\d{4}|
\d{4}[a-z]\d{3}|
\d{5}[a-z]\d{2}|
\d{6}[a-z]\d{1}|
\d{7}[a-z])
\b",
RegexOptions.IgnoreCase | RegexOptions.IgnorePatternWhitespace);
Note the word boundary anchors, which you should remove if this pattern is part of a longer string.
Also note the IgnoreCase option, which you can remove if all letters will be lower case.
Edit: See #stema Answer -- much more concise regex
This will match what you want:
(\d{1}\w\d{6}|\d{2}\w\d{5}|\d{3}\w\d{4}|\d{4}\w\d{3}|\d{5}\w\d{2}|\d{6}\w\d{1}|\d{7}\w)
I generated it like this, in powershell:
$n = 6;
for ($i = 1; $i -le 6; $i++) {
write-host "\d{"$i"}\w\d{"$n"}"
$n--
}
Your example will only work when the character is the first character in the string.
The problem you've got is that you need a total of 7 digits, and absolutely only one character potentially within those 7 digits. This is not something that's possible with regular expressions as defined in theory, because you have to have a link between the two groups of digits to see how many are in the other group and regexes can't carry that kind of context around with them.
I was wondering if it was possible using a lookahead assertion to ensure there's only one letter, but the best I can do is ensuring there's no instance of two letters in a row, which doesn't cover all possible invalid cases. Thus I think you're going to have to find another method, as npinti suggested. So something like:
public static bool Match(string s) {
return (s.Length == 8) &&
(s.Where(Char.IsDigit).Count() == 7) &&
(s.Where(Char.IsLetter).Count() == 1);
}
But I haven't tested that.
just use this if you want one letter and 7 digit
"[A-Za-z]{1}[0-9]{7}|[0-9]{7}[A-Za-z]{1}|[0-9]{1}[A-Za-z]{1}[0-9]{6}[0-9]{1}|[0-9]{2}[A-Za-z]{1}[0-9]{5}|[0-9]{3}[A-Za-z]{1}[0-9]{4}|[0-9]{4}[A-Za-z]{1}[0-9]{3}|[0-9]{5}[A-Za-z]{1}[0-9]{2}"
and here a code snippet how you can iterate through your result
string st = "1111111q 2222222q 111e1111 11e11111";
string pattS = #"[A-Za-z]{1}[0-9]{7}|[0-9]{7}[A-Za-z]{1}|[0-9]{1}[A-Za-z]{1}[0-9]{6}[0-9]{1}|[0-9]{2}[A-Za-z]{1}[0-9]{5}|[0-9]{3}[A-Za-z]{1}[0-9]{4}|[0-9]{4}[A-Za-z]{1}[0-9]{3}|[0-9]{5}[A-Za-z]{1}[0-9]{2}";
Regex regex = new Regex(pattS);
var res = regex.Matches(st);
foreach (var re in res)
{
}
check here on rubular it covers all examples you provide
You can use this pattern:
^([0-9])(?:\1|[a-z](?!.*[a-z])){7}|[a-z]([0-9])\2{6}$
With Regex, you can do it in two steps. First you can remove the character, in whatever position it is:
string input = "111a1111";
Regex rgx = new Regex(#"[a-zA-Z]");
string output=rgx.Replace(input,"",1); // remove only one character
// output = "1111111"
then you can match with [0-9]{7} (if you don't want all digits to be the same)
or with ^(\d)\1{6}$ (if you want 7 occurrences of the same digit)

Unexpected regular expression groups

I want to use regular expressions for analyzing a url, but I can't get the regex groups as I would expect them to be.
My regular expression is:
#"member/filter(.*)(/.+)*"
The strings to match:
"member/filter-one"
"member/filter-two/option"
"member/filter-three/option/option"
I expect to get the following groups:
member/filter-one, /filter-one
member/filter-two/option, /filter-two, /option
member/filter-three/option/option, /filter-three, /option(with 2 captures)
I get the result for the first string, but fore the 2 others I get:
member/filter-two/option, /filter-two/option, empty string
member/filter-three/option/option, /filter-three/option/option, empty string
What can be the issue?
Try
#"member/filter([^/]*)(/.+)*"
Another way could be to use the MatchCollection this way:
string url = "member/filter-three/option/option";
url = url.Replace("member/filter-", string.Empty); // cutting static content
MatchCollection matches = new Regex(#"([^/]+)/?").Matches(url);
foreach (Match match in matches)
{
Console.WriteLine(match.Groups[1].Value);
}
Console.ReadLine();
Here, you first remove the constant part from your string (it could be a parameter of a function). Then you simply check for everything inside two / characters. You do that by identifying [^/] as the character you want to match, which means match one character, that is not a /, then put an identifier after that (+ sign), which means, match more than one character.
The "member/filter([^/]*)(/.+)*" seems logical but is impractical as it accepts empty options (i.e. member/filter1/////////). A more accurate-practical pattern which also allows you to accept more than one filter with options is member(/filter[^/]+(/[^/]+)*)*

Categories