Improving my failing regex

Improving my failing regex - c#

My regex was working - until the form of the string it was capturing slightly changed. It used to always be of the form :
Word1 - Word2 - 01.2.3456.7890 - xx-xx - Word 3 [Word-inbracket]
Where I was interested in capturing the xx-xx.
For capturing this data, the following regex worked :
(.+\s*-\s*.+\s*-\s*.+)\s*-\s*(\w{1,3}\s*-\s*\w{1,3})\s*-\s*.+
Selecting groups[2] from it.
Now, however, the string has changed form so that sometimes there is another dash, and another set of letters between 1 and 4 characters after the xx-xx. (Remember, this only happens sometimes).
So, now I also need to capture the info where it is of the form :
Word1 - Word2 - 01.2.3456.7890 - xx-xx-XxxX - Word 3 [Word-inbracket]
Word1 - Word2 - 01.2.3456.7890 - xXX-XxX-xxxx - Word 3 [Word-inbracket]
Etc.
How can I edit my regex to capture this string in addition to the ones that were previously caught? What is the cleanest way to do this ?

A little hacky but that will do the trick:
(.+\s*-\s*.+\s*-\s*.+)\s*-\s*((\w{1,3}\s*-\s*\w{1,3})|(\w{1,4}\s*-\s*\w{1,4}))\s*-\s*.+

Based on the input lines, a more simplified approach could be taken altogether.
The following regex matches both cases and should also work for any other modifications to the part the was modified.
([^-]*-){3}\s*([^\s]+).*
This should capture the first group of with "Word1 - Word2 - 01.2.3456.7890 -", and then the second group of "xx-xx-XxxX".
Also note, I'm going off of the assumption that the second desired group does not contain spaces as the example strings do not have them.
Explained:
([^-]*-){3} # captures the "word1 - word2 - word3.234.234 -" block
\s*
([^\s]+) # captures the "xx-xx-xxx" block up to the first whitespace char.
.* # matches the rest of the line

I believe this should do it:
(.+?\s*-\s*.+?\s*-\s*.+?)\s*-\s*(\w{1,3}\s*-\s*\w{1,3})\s*(?:-(\w{1,3}))?\s*-\s*.+
The changes I've made are:
Made the any characters matches at the beginning non-greedy by adding '?' after them — this stops them gobbling up too much when the extra bit is present.
Added the '(?:-(\w{1,3}))?' which matches the optional extra bit if it is present, but without capturing the '-' prefix (the '?:' makes the outer group non-capturing).
This will give you an extra capturing group that includes the optional bit.
You can see it in action here (edited).

this is more clear .+\s-\s(.+)\s-\s.+$

Related

Regex which matches URN by rfc8141

I am struggling to find a Regex which could match a URN as described in rfc8141.
I have tried this one:
\A(?i:urn:(?!urn:)(?<nid>[a-z0-9][a-z0-9-]{1,31}):(?<nss>(?:[a-z0-9()+,-.:=#;$_!*']|%[0-9a-f]{2})+))\z
but this one only matches the first part of the URN without the components.
For example lets say we have the corresponding URN: urn:example:a123,0%7C00~&z456/789?+abc?=xyz#12/3 We should match the following groups:
NID - example
NSS - a123,0%7C00~&z456/789 (from the last ':' tll we match '?+' or '?=' or '#'
r-component - abc (from '?+' till '?=' or '#'')
f-component - 12/3 (from '#' till end)

I haven't read all the specifications, so there may be other rules to implement, but it should put you on the way for the optional components:
\A(?i:urn:(?!urn:)(?<nid>[a-z0-9][a-z0-9-]{1,31}):(?<nss>(?:[-a-z0-9()+,.:=#;$_!*'&~\/]|%[0-9a-f]{2})+)(?:\?\+(?<rcomponent>.*?))?(?:\?=(?<qcomponent>.*?))?(?:#(?<fcomponent>.*?))?)\z
explanations:
(?<nss>(?:[-a-z0-9()+,.:=#;$_!*'&~\/]|%[0-9a-f]{2})+) : The - has been moved to the beginning of the list to be considered in the allowed chars, or else it means "range from , to .". The characters &, ~ and / (has to be escaped with "\") have also been added to the list, or else it won't match your example.
optional components: (?:\?\+(?<rcomponent>.*?))? : inside an optional non-capturing group (?:)? to prevent capturing the identifier (the ?+, ?= and # part). The chars ? and + have to be escaped with "\". Will capture anything (.) but in lazy mode (*?) or else the first component found would capture everything until the end of the string.
See working example in Regex101
Hope that helps

If you want to validate string with Uniform Resource Names (URNs) 8141: rfc8141 You can refer to URN8141Test.java and URN8141.java
It has been used in our team for a few years.

Regex c#: Last Name in an email to include only one hyphen

Friends,
I have a textbox, which takes firstname.Lastname for my organisation.
The last name may or may not include hyphen.If it includes, then it should appear,
1) Only once in last name
2) Not at the beginning of lastname
3) not at the end of last name
I have this Regex figured out
^(?!.{51})[a-zA-Z]+(?:[.][a-zA-Z-]+)?$
This includes "-" in last name. But will not satisfy above conditions.
Im still learning regex, and is taking time to figure out this.
Please help
-Thank you

You need to add one more nested group inside the last name part:
^(?!.{51})[a-zA-Z]+(?:\.[a-zA-Z]+(?:-[a-zA-Z]+)?)?$
^^^^^^^^^^^^^^^
See the regex demo
Details:
^ - start of string
(?!.{51}) - no more than 50 chars in the string requirement
[a-zA-Z]+ - 1+ ASCII letters
(?:\.[a-zA-Z]+(?:-[a-zA-Z]+)?)? - an optional sequence of:
\. - a dot
[a-zA-Z]+ - 1 or more ASCII letters
(?:-[a-zA-Z]+)? - an optional sequence of:
- - a hyphen
[a-zA-Z]+ - 1+ ASCII letters
$ - end of string.
To declare this pattern, use a verbatim string literal:
var pattern = #"^(?!.{51})[a-zA-Z]+(?:\.[a-zA-Z]+(?:-[a-zA-Z]+)?)?$";
To match any Unicode letters, use \p{L} instead of [a-zA-Z].

This is a somehow Quick and Dirty Solution.
^(?!.{51})[a-zA-Z]+.{1}[a-zA-Z]{1,}-?[a-zA-Z]+$
The first part of the string is still used from you. It seem to match just perfect.
My Addition .{1}[a-zA-Z]{1,}-?[a-zA-Z]+ simply describes what happens next.
[a-zA-Z]{1,} makes sure that there is at least one charackter after the dot, so no "-" can be written as beginning.
-?just describes that there may be appear a "-" somewhere after the first characters has been written.
[a-zA-Z]+ just says that there may come as many characters as the user want's. But not a "-" anymore.

C# Regular expression to match on a character not following pairs of the same charcater

Objective: Regex Matching
For this example I'm interested in matching a "|" pipe character.
I need to match it if it's alone: "aaa|aaa"
I need to match it (the last pipe) only if it's preceded by pairs of pipe: (2,4,6,8...any even number)
Another way: I want to ignore ALL pipe pairs "||" (right to left)
or I want to select bachelor bars only (the odd man out)
string twomatches = "aaaaaaaaa||||**|**aaaaaa||**|**aaaaaa";
string onematch = "aaaaaaaaa||**|**aaaaaaa||aaaaaaaa";
string noMatch = "||";
string noMatch = "||||";
I'm trying to select the last "|" only when preceded by an even sequence of "|" pairs or in a string when a single bar exists by itself.
Regardless of the number of "|"

You may use the following regex to select just odd one pipe out:
(?<=(?<!\|)(?:\|{2})*)\|(?!\|)
See regex demo.
The regex breakdown:
(?<=(?<!\|)(?:\|{2})*) - if a pipe is preceded with an even number of pipes ((?:\|{2})* - 0 or more sequences of exactly 2 pipes) from a position that has no preceding pipe ((?<!\|))
\| - match an odd pipe on the right
(?!\|) - if it is not followed by another pipe.
Please note that this regex uses a variable-width look-behind and is very resource-consuming. I'd rather use a capturing group mechanism here, but it all depends on the actual purpose of matching that odd pipe.
Here is a modified version of the regex for removing the odd one out:
var s = "1|2||3|||4||||5|||||6||||||7|||||||";
var data = Regex.Replace(s, #"(?<!\|)(?<even_pipes>(?:\|{2})*)\|(?!\|)", "${even_pipes}");
Console.WriteLine(data);
See IDEONE demo. Here, the quantified part is moved from lookbehind to an even_pipes named capturing group, so that it could be restored with the backreference in the replaced string. Regexhero.net shows 129,046 iterations per second for the version with a capturing group and 69,206 with the original version with variable-width lookbehind.
Only use variable-width look-behind if it is absolutely necessary!

Oh, it's reopened! If you need better performance, also try this negative improved version.
\|(?!\|)(?<!(?:[^|]|^)(?:\|\|)*)
The idea here is to first match the last literal | at right side of a sequence or single | and execute a negated version of the lookbehind just after the match. This should perform considerably better.
\|(?!\|) matches literal | IF NOT followed by another pipe character (right most if sequence).
(?<!(?:[^|]|^)(?:\|\|)*) IF position right after the matched | IS NOT preceded by (?:\|\|)* any amount of literal || until a non| or ^ start.In other words: If this position is not preceded by an even amount of pipe characters.
Btw, there is no performance gain in using \|{2} over \|\| it might be better readable.
See demo at regexstorm

C# Regular Expression: Search the first 3 letters of each name

Does anyone know how to say I can get a regex (C#) search of the first 3 letters of a full name?
Without the use of (.*)
I used (.**)but it scrolls the text far beyond the requested name, or
if it finds the first condition and after 100 words find the second condition he return a text that is not the look, so I have to limit in number of words.
Example: \s*(?:\s+\S+){0,2}\s*
I would like to ignore names with less than 3 characters if they exist in name.
Search any name that contains the first 3 characters that start with:
'Mar Jac Rey' (regex that performs search)
Should match:
Marck Jacobs L. S. Reynolds
Marcus Jacobine Reys
Maroon Jacqueline by Reyils
Can anyone help me?

The zero or more quantifier (*) is 'greedy' by default—that is, it will consume as many characters as possible in order to finding the remainder of the pattern. This is why Mar.*Jac will match the first Mar in the input and the last Jac and everything in between.
One potential solution is just to make your pattern 'non-greedy' (*?). This will make it consume as few characters as possible in order to match the remainder of the pattern.
Mar.*?Jac.*?Rey
However, this is not a great solution because it would still match the various name parts regardless of what other text appears in between—e.g. Marcus Jacobine Should Not Match Reys would be a valid match.
To allow only whitespace or at most 2 consecutive non-whitespace characters to appear between each name part, you'd have to get more fancy:
\bMar\w*(\s+\S{0,2})*\s+Jac\w*(\s+\S{0,2})*\s+Rey\w*
The pattern (\s+\S{0,2})*\s+ will match any number of non-whitespace characters containing at most two characters, each surrounded by whitespace. The \w* after each name part ensures that the entire name is included in that part of the match (you might want to use \S* instead here, but that's not entirely clear from your question). And I threw in a word boundary (\b) at the beginning to ensure that the match does not start in the middle of a 'word' (e.g. OMar would not match).

I think what you want is this regular expression to check if it is true and is case insensitive
#"^[Mar|Jac|Rey]{3}"
Less specific:
#"^[\w]{3}"

If you want to capture the first three letters of every words of at least three characters words you could use something like :
((?<name>[\w]{3})\w+)+
And enable ExplicitCapture when initializing your Regex.
It will return you a serie of Match named "name", each one of them is a result.
Code sample :
Regex regex = new Regex(#"((?<name>[\w]{3})\w+)+", RegexOptions.ExplicitCapture | RegexOptions.IgnoreCase);
var match = regex.Matches("Marck Jacobs L. S. Reynolds");
If you want capture also 3 characters words, you can replace the last "\w" by a space. In this case think to handle the last word of the phrase.

Regex to find the href from anchore tag for specific cases in c# .net

I have some specific cases for which regex does not fit.
Examples
1. bhjbhj-----containing quote(') in between " " and having space in between
2. gffgd
3. bhjbhj----without quotes having space
I have used the regex from here
My regex is as below
<a\s+[^>]*\s*href\s*=('|"|)\s*((?:[^\1|>]|[\n\r])+)(\1)[ |>][^>]*?>(.*?)</a>
So 1 and 2 works fine but for 3 it gives
abyuyyuub/m'hhjhh/js jmm but it should be
abyuyyuub/m'hhjhh/js
Also I want to know how to match the first occurrence i.e. how to match the first double quote("),single quote(') or space

If you insist on having that particular regex satisfying all three examples - you should add an \s to the exclusion in the non-capture group inside the second capture group: [^\1|>] to [^\1\s|>], resulting in:
<a\s+[^>]*\s*href\s*=('|"|)\s*((?:[^\1\s|>]|[\n\r])+)(\1)[ |>][^>]*?>(.*?)</a>
Overall it is of course a "bad idea" (tm) to parse urls in this way, as previously mentioned in the comments.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Improving my failing regex - c#

A little hacky but that will do the trick: (.+\s-\s.+\s-\s.+)\s-\s((\w{1,3}\s-\s\w{1,3})|(\w{1,4}\s-\s\w{1,4}))\s-\s.+

this is more clear .+\s-\s(.+)\s-\s.+$

Related

Regex which matches URN by rfc8141

Regex c#: Last Name in an email to include only one hyphen

C# Regular expression to match on a character not following pairs of the same charcater

C# Regular Expression: Search the first 3 letters of each name

Regex to find the href from anchore tag for specific cases in c# .net

Categories

Resources

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Improving my failing regex - c#

A little hacky but that will do the trick: (.+\s*-\s*.+\s*-\s*.+)\s*-\s*((\w{1,3}\s*-\s*\w{1,3})|(\w{1,4}\s*-\s*\w{1,4}))\s*-\s*.+

this is more clear .+\s-\s(.+)\s-\s.+$

Related

Regex which matches URN by rfc8141

Regex c#: Last Name in an email to include only one hyphen

C# Regular expression to match on a character not following pairs of the same charcater

C# Regular Expression: Search the first 3 letters of each name

Regex to find the href from anchore tag for specific cases in c# .net

Categories

Resources

A little hacky but that will do the trick: (.+\s-\s.+\s-\s.+)\s-\s((\w{1,3}\s-\s\w{1,3})|(\w{1,4}\s-\s\w{1,4}))\s-\s.+