Regex for string with spaces and special characters - C# - c#

I have been using Regex to match strings embedded in square brackets [*] as:
new Regex(#"\[(?<name>\S+)\]", RegexOptions.IgnoreCase);
I also need to match some codes that look like:
[TESTTABLE: A, B, C, D]
it has got spaces, comma, colon
Can you please guide me how can I modify my above Regex to include such codes.
P.S. other codes have no spaces/special charaters but are always enclosed in [...].

Regex myregex = new Regex(#"\[([^\]]*)]")
will match all characters that are not closing brackets and that are enclosed between brackets. Capture group \1 will match the content between brackets.
Explanation (courtesy of RegexBuddy):
Match the character “[” literally «\[»
Match the regular expression below and capture its match into backreference number 1 «([^\]]*)»
Match any character that is NOT a ] character «[^\]]*»
Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
Match the character “]” literally «]»
This will also work if you have more than one pair of matching brackets in the string you're looking at. It will not work if brackets can be nested, e. g. [Blah [Blah] Blah].

/\[([^\]:])*(?::([^\]]*))?\]/
Capture group 1 will contain the entire tag if it doesn't have a colon, or the part before the colon if it does.
Capture group 2 will contain the part after the colon. You can then split on ',' and trim each entry to get the individual parts.

Related

Usin Regex, find any text inside square brackets, ignoring the ones with a preceding backlash ("\")

I'm trying to find a regular expression that will match all groups of text inside square brackets, except the ones with a preceding backlash. To ilustrate my point, given this text:
[#5C9269]\[Go to [size=120%]1[/size] now][/color]
Only the groups #5C9269, size=120%, /size and /color will match, ignoring the group Go to [size=120%]1[/size] now with the preceding backlash character.
My best attempt at this problem was the following regex /([^\\])\[([^{}]*?)\]/, looking for the absence of a preceding backlash and capturing the target text in two groups, but this expression fails to capture valid matches at the start of the line as there's no character before them, like in my previous example.
In your pattern you are matching a character ([^\\]) at the start of the string which is expected to be present.
You can exclude a preceding backslash using a negative lookbehind (which is non consuming). Then in the character class exclude matching [ and ] instead of { and }
Then you can also remove the non greedy quantifier ? from [^{}]*? as the square brackets can not cross the closing one.
The values that you want are in capture group 1.
(?<!\\)\[([^][]+)]
Explanation
(?<!\\) Positive lookbehind, assert no \ directly to the left of the current position
\[ Match the opening
([^][]+) Capture group 1, match 1+ occurrences of any char except [ and ]
] Match the closing ]
See a regex demo.
If you also want to match emtpy strings between the square brackets you can use [^][]*

Match up to the comma - Regex

I have created a Regex Pattern (?<=[TCC|TCC_BHPB]\s\d{3,4})[-_\s]\d{1,2}[,]
This Pattern match just:
TCC 6005_5,
What should I change to the end to match these both strings:
TCC 6005-5 ,
TCC 6005_5,
You can add a non-greedy wildcard to your expression (.*?):
(?<=(?:TCC|TCC_BHPB)\s\d{3,4})[-_\s]\d{1,2}.*?[,]
^^^
This will now also match any characters between the last digit and the comma.
As has been pointed out in the comments, [TCC|TCC_BHPB] is a character class rather than a literal match, so I've changed this to (?:TCC|TCC_BHPB) which is presumably what your intention was.
Try it online
This part of the pattern [TCC|TCC_BHPB] is a character class that matches one of the listed characters. It might also be written for example as [|_TCBHP]
To "match" both strings, you can match all parts instead of using a positive lookbehind.
\bTCC(?:_BHPB)?\s\d{3,4}[-_\s]\d{1,2}\s?,
See a regex demo
\bTCC A word boundary to prevent a partial match, then match TCC
(?:_BHPB)?\s\d{3,4} Optionally match _BHPB, match a whitespace char and 3-4 digits (Use [0-9] to match a digit 0-9)
[-_\s]\d{1,2} Match one of - _ or a whitespace char
\s?, Match an optional space and ,
Note that \s can also match a newline.
Using the lookbehind:
(?<=TCC(?:_BHPB)?\s\d{3,4})[-_\s]\d{1,2}\s?,
Regex demo
Or if you want to match 1 or more spaces except a newline
\bTCC(?:_BHPB)?[\p{Zs}\t][0-9]{3,4}[-_\p{Zs}\t][0-9]{1,2}[\p{Zs}\t]*,
Regex demo

Regular Expression that matches on values after a pipe in between brackets

I'm still learning a lot about regex, so please forgive any naivety.
I've been using this site to test:
http://www.systemtextregularexpressions.com/regex.match
Basically, I'm having issues writing a regular expression that will match on any value after a pipe in between brackets.
Given an example string of:
"<div> \n [dont1.dont2|match1|match2] |dont3 [dont4] dont5. \n </div>"
Expected output would be a collection:
match1,
match2
The closest I've been able to get so far is:
(?!\[.*(\|)\])(?:\|)([\w-_.,:']*)
Above gives me the values, including the pipes, and dont3.
I've also tried this guy:
\|(.*(?=\]))
but it outputs:
|match1|match2
Here's one way of doing it:
(?<=\[[^\]]*\|)[^\]|]*
Here's the meaning of the pattern:
(?<=\[[^\]]*\|) - Lookbehind expression to ensure that any match must be preceded by an open bracket, followed by any number of non-close-bracket characters, followed by a pipe character
(?<= ... ) - Declares a lookbehind expression. Something matching the lookbehind must immediately precede the text in order for it the match. However, the part matched by the lookbehind is not included in the resulting match.
\[ - Matches an open bracket character
[^\]]* - Matches any number of non-close-bracket characters
\| - Matches a pipe character
[^\]|]* - Matches any number of characters which are neither close brackets nor pipe characters.
The lookbehind is greedy, so it will allow for any number of pipes between the open bracket and the matching text.
try this:
\[.*?(?:\|(?<mydata>.*?))+\]
note: the online tool will only show you the last capture inside a quantifed () for a given match, but .NET will remember each capture of a group that matches multiple times
Try this:
^<div>\s*[^|]+|([^|]+)|([^|]+)

Regex to clean repetitions of characters

I have a pattern in the string like this:
T T and I want to T
And It can be any character from [a-z].
I have tried this Regex Example but not able to replace it.
EDIT
Like I have A Aa ar r then it should become Aar means replace any character 1st or 2nd no matter what it is.
You can use the backreferences for this.
/([a-z])\s*\1\s?/gi
Example
Some more explanation:
( begin matching group 1
[a-z] match any character from a to z
) end matching group 1
\s* match any amount of space characters
\1 match the result of matching group 1
exactly as it was again
this allows for the repition
\s? match none or one space character
this will allow to remove multiple
spaces when replacing

Regex match if a string has length 2 and contains 1 letter and 1 number

Guys I hate Regex and I suck at writing.
I have a string that is space separated and contains several codes that I need to pull out. Each code is marked by beginning with a capital letter and ending with a number. The code is only two digits.
I'm trying to create an array of strings from the initial string and I can't get the regular expression right.
Here is what I have
String[] test = Regex.Split(originalText, "([a-zA-Z0-9]{2})");
I also tried:
String[] test = Regex.Split(originalText, "([A-Z]{1}[0-9]{1})");
I don't have any experience with Regex as I try to avoid writing them whenever possible.
Anyone have any suggestions?
Example input:
AA2410 F7 A4 Y7 B7 A 0715 0836 E0.M80
I need to pull out F7, A4, B7. E0 should be ignored.
You want to collect the results, not split on them, right?
Regex regexObj = new Regex(#"\b[A-Z][0-9]\b");
allMatchResults = regexObj.Matches(subjectString);
should do this. The \bs are word boundaries, making sure that only entire strings (like A1) are extracted, not substrings (like the A1 in TWA101).
If you also need to exclude "words" with non-word characters in them (like E0.M80 in your comment), you need to define your own word boundary, for example:
Regex regexObj = new Regex(#"(?<=^|\s)[A-Z][0-9](?=\s|$)");
Now A1 only matches when surrounded by whitespace (or start/end-of-string positions).
Explanation:
(?<= # Assert that we can match the following before the current position:
^ # Start of string
| # or
\s # whitespace.
)
[A-Z] # Match an uppercase ASCII letter
[0-9] # Match an ASCII digit
(?= # Assert that we can match the following after the current position:
\s # Whitespace
| # or
$ # end of string.
)
If you also need to find non-ASCII letters/digits, you can use
\p{Lu}\p{N}
instead of [A-Z][0-9]. This finds all uppercase Unicode letters and Unicode digits (like Ä٣), but I guess that's not really what you're after, is it?
Do you mean that each code looks like "A00"?
Then this is the regex:
"[A-Z][0-9][0-9]"
Very simple... By the way, there's no point writing {1} in a regex. [0-9]{1} means "match exactly one digit, which is exactly like writing [0-9].
Don't give up, simple regexes make perfect sense.
This should be ok:
String[] all_codes = Regex.Split(originalText, #"\b[A-Z]\d\b");
It gives you an array with all code starting with a capital letter followed by a digit, separated by an kind of word boundary (site space etc.)

Categories