Capture data from string not containing duplicate group of characters and strings - c#

I am trying to verify and extract data coming from API. I need to extract text between [] brackets which can be anywhere in the data. e.g.
This is [extract] message
This is message [extract]
[extract] this message
Regular expression, I was using for this as below was working fine
^[^\]\[]*?\[(?<description>[^\]\[]+)\][^\]\[]*?$
Now the data from API can be HTML encoded and have %5B instead of [ and %5D instead of ].
I updated regular expression to below:
^[^\]\[%5B%5D]*?(\[|%5B)(?<description>[^\]\[%5B%5D]+)(\]|%5D)[^\]\[%5B%5D]*?$/i
But it is not treating %5B and %5D as single atom. And therefore not able to extract text from following valid data:
This is [extract] message %
This is message 5 [extract]
[extract d] this message
And able to extract text from following invalid data:
[extract %5D this message
%5B extract ] this message
How can I treat %5B and %5D as atoms and correct above regex?

First of all, your first regex should be written as
^[^][]*\[(?<description>[^][]+)][^][]*$
Note there is no point escaping [ inside a character class and there is no need escaping ] inside the character class if it is the first char there and the ] outside the character class. Also, no need using lazy quantifiers *?, you can use * equally well.
Now, you should decode the string to the plain text and then run the above regex. If you do not want to do that, you will have to use a complex regex based on a tempered greedy token like
^(?:(?!%5[DB])[^][])*(?:%5B|\[)(?<description>(?:(?!%5[DB])[^][])+)(?:]|%5D)(?:(?!%5[DB])[^][])*$
See the regex demo (additional patterns are added since it is a multiline demo).
Regex explanation:
^ - string start
(?:(?!%5[DB])[^][])* - a tempered greedy token matching any 0+ symbols other than ] and [ (see [^][]) that is not the starting char for a %5B or %5D char sequence
(?:%5B|\[) - the leading delimiter, a %5B or [
(?<description>(?:(?!%5[DB])[^][])+) - The "description" group matching 1+ symbols other than ] and [ that is not the starting char for a %5B or %5D char sequence (NOTE: you might want to replace with with (?<description>(?s:.+?)) subpattern to check if that works for you better).
(?:]|%5D) - trailing delimiter, ] or %5D
(?:(?!%5[DB])[^][])* - see above (2nd line)
$ - end of string.

Related

Usin Regex, find any text inside square brackets, ignoring the ones with a preceding backlash ("\")

I'm trying to find a regular expression that will match all groups of text inside square brackets, except the ones with a preceding backlash. To ilustrate my point, given this text:
[#5C9269]\[Go to [size=120%]1[/size] now][/color]
Only the groups #5C9269, size=120%, /size and /color will match, ignoring the group Go to [size=120%]1[/size] now with the preceding backlash character.
My best attempt at this problem was the following regex /([^\\])\[([^{}]*?)\]/, looking for the absence of a preceding backlash and capturing the target text in two groups, but this expression fails to capture valid matches at the start of the line as there's no character before them, like in my previous example.
In your pattern you are matching a character ([^\\]) at the start of the string which is expected to be present.
You can exclude a preceding backslash using a negative lookbehind (which is non consuming). Then in the character class exclude matching [ and ] instead of { and }
Then you can also remove the non greedy quantifier ? from [^{}]*? as the square brackets can not cross the closing one.
The values that you want are in capture group 1.
(?<!\\)\[([^][]+)]
Explanation
(?<!\\) Positive lookbehind, assert no \ directly to the left of the current position
\[ Match the opening
([^][]+) Capture group 1, match 1+ occurrences of any char except [ and ]
] Match the closing ]
See a regex demo.
If you also want to match emtpy strings between the square brackets you can use [^][]*

Remove some specific string with special character

Input String
string b = "14-03-002980 AND 14-03- [ ] (5)Description of 002981";
In output String I Want Result As
4-03-002980 AND 14-03-002981
I tried with below regex but it, not works
Regex.Replace(b, "[#&'(\\s)<>(5)Description of ]","");
Plaese, help me if anyone knows how to do this thing.
You can use this regex,
\s+\[.*(?=\b\d+)
and replace it with empty string.
You start with one or more whitespace then match a [ using \[ and then .* consumes all the characters greedily and only stops when it sees a number using positive look ahead (?=\b\d+)
Regex Demo

Regular expression in RegularExpressionAttribute behavior

I am using this regular expression: #"[ \]\[;\/\\\?:*""<>|+=]|^[.]|[.]$"
First part [ \]\[;\/\\\?:*""<>|+=] should match any of the characters inside the brackets.
Next part ^[.] should match if the string starts with a 'dot'
Last part [.]$ should match if the string ends with a 'dot'
This works perfectly fine if I use Regex.IsMatch() function. However if I use RegularExpressionAttribute in ASP.NET MVC, I always get invalid model. Does anyone have any clue why this behavior occurs?
Examples:
"abcdefg" should not match
".abcdefg" should match
"abc.defg" should not match
"abcdefg." should match
"abc[defg" should match
Thanks in advance!
EDIT:
The RegularExpressionAttribute Specifies that a data field value in ASP.NET Dynamic Data must match the specified regular expression..
Which means. I need the "abcdef" to match, and ".abcdefg" to not match. Basically negate the whole expression I have above.
You need to make sure the pattern matches the entire string.
In a general case, you may append/prepend the pattern with .*.
Here, you may use
.*[ \][;/\\?:*"<>|+=].*|^[.].*|.*[.]$
Or, to make it a bit more efficient (that is, to reduce backtracking in the first branch) a negated character class will perform better:
[^ \][;/\\?:*"<>|+=]*[ \][;\/\\?:*"<>|+=].*|^[.].*|.*[.]$
But it is best to put the branches matching text at the start/end of the string as first branches:
^[.].*|.*[.]$|[^ \][;/\\?:*"<>|+=]*[ \][;/\\?:*"<>|+=].*
NOTE: You do not have to escape / and ? chars inside the .NET regex since you can't use regex delimiters there.
C# declaration of the last pattern will look like
#"^[.].*|.*[.]$|[^ \][;/\\?:*""<>|+=]*[ \][;/\\?:*""<>|+=].*"
See this .NET regex demo.
RegularExpressionAttrubute:
[RegularExpression(
#"^[.].*|.*[.]$|[^ \][;/\\?:*""<>|+=]*[ \][;/\\?:*""<>|+=].*",
ErrorMessage = "Username cannot contain following characters: ] [ ; / \\ ? : * \" < > | + =")
]
Your regex is an alternation which matches 1 character out of 3 character classes, the first consisting of more than 1 characters, the second a dot at the start of the string and the third a dot at the end of the string.
It works fine because it does match one of the alternations, only not the whole string you want to match.
You could use 3 alternations where the first matches a dot followed by repeating the character class until the end of the string, the second the other way around but this time the dot is at the end of the string.
Or the third using a positive lookahead asserting that the string contains at least one of the characters [\][;\/\\?:*"<>|+=]
^\.[a-z \][;\/\\?:*"<>|+=]+$|^[a-z \][;\/\\?:*"<>|+=]+\.$|^(?=.*[\][;\/\\?:*"<>|+=])[a-z \][;\/\\?:*"<>|+=]+$
Regex demo

How to check for nested square brackets?

I have the input string below:
[text1][text2][text3]...[textN]
and I want to apply the following validation rule using regular expression:
The ] and [ cannot be included in other [].
For example, the next input strings are not correct:
[test1][test2[][test3]
[test1][test2]][test3]
[test1][test2[lol][test3]
[test1][test2]lol][test3]
I need to validate the input string because I am going to split it on [] groups (again using regular expression).
If you really want a regexp here is a quick one :
^(\[[^\[\]]+\])*$
Works on your examples
The principle here is for each bracket pair (\[.*\])* to contain any text that does NOT contains a bracket [^\[\]]+
In case you need to be able to have [test1][test2][][test3] working change the + with an * to allow the empty string to match
This should do the trick:
^(\[\w*\])*$
It means
^ start with
[ a [
\w* multiple word characters (\w matches [A-Za-z0-9_])
] a ]
* multiple times
$ end of string

Star - look for the character * in a string using regex

I am trying to find the following text in my string : '***'
the thing is that the C# Regex mechanism doesnt allow me to do the following:
new Regex("***", RegexOptions.CultureInvariant | RegexOptions.Compiled);
due to
ArgumentException: "parsing "*" - Quantifier {x,y} following nothing."
obviously it thinks that my stars represents regular expressions,
is there a way to tell the Regex mechanism to treat stars as just stars and nothing else?
* in Regex means:
Matches the previous element zero or more times.
so that, you need to use \* or [*] instead.
explain:
\
When followed by a character that is not recognized as an escaped character in this and other tables in this topic, matches that character. For example, \* is the same as \x2A.
[ character_group ]
Matches any single character in character_group.
You need to escape the star with a backslash: #"\*"

Categories