I have to extract string between digit pattern and either a colon or newline (first occurence)
my string would look like:
05-30-1306-29-13 BUILDERS RISK:
LIMITS/DEDUCTIBLES:
I would like to extract BUILDERS RISK. There may or may not be a colon, in such case we will treat newline as the terminating pattern
Here's what I have come up with so far
\d{2}-\d{2}-\d{4}-\d{2}-\d{2}\s*\W+[^:|\n]+:\s*
Numerical pattern will always be 2-2-4-2 followed by any string followed by either \n or :
The regex so far gets what I need but I don't know how to break it into different matches so I can take the second match
1st match - digit pattern
2nd match - what i need
3rd match - colon or newline
Any pointers will be helpful.
UPDATE: Couple of alternatives of the text term to be searched could be this
11-06-1212-29-12 DWELLING FIRE (DP-3): ANNUAL RENTAL
11-05-1212-26-12 HOMEOWNERS (HO-3): SECONDARY HOME
I would only want anything before colon or if that is not there, take string till newline is found. As a side note, the text of significance may not be present in same line and appear in next line but will always be followed by either a colon or newline in the same line.
PS: Extracted text should not contain colon
It appears you may use
\b(\d{2}-\d{2}-\d{4}-\d{2}-\d{2})\W+(.*?)(:?\r?\n\s*)
See the regex demo yielding
Details
\b - a word boundary (change to (?<!\d) if the digits can be glued to a letter or underscore)
(\d{2}-\d{2}-\d{4}-\d{2}-\d{2}) - Group 1: two digits, -, two digits, -, four digits, -, two digits, -, two digits
\W+ - 1+ non-word chars (to stay on the line, replace with [^\w\r\n]+)
(.*?) - Group 2: any zero or more chars other than newline, as few as possible
(:?\r?\n\s*) - Group 3: an optional :, an optional CR, an LF symbol and then any 0+ whitespace chars.
Related
I am trying to build regex to match - Test get:all words:test
can start with a word then space and followed by any occurrence of word:word separated by space.
#"^[a-zA-Z]+/s(^[a-zA-Z]+:^[a-zA-Z]+/s)*"
You added extra start of string anchors, ^, inside the pattern, and you need to remove them for sure.
Besides, the whitespace patterns must be written as \s and the first \s must be moved inside the repeated group that should be converted into a non-capturing one ((?:...)) for better performance.
You can use
^[a-zA-Z]+(?:\s+[a-zA-Z]+:[a-zA-Z]+)*$
See the regex demo. Details:
^ - start of string
[a-zA-Z]+ - one or more ASCII letters
(?:\s+[a-zA-Z]+:[a-zA-Z]+)* - zero or more repetitions of
\s+ - one or more whitespaces
[a-zA-Z]+:[a-zA-Z]+ - one or more ASCII letters, :, one or more ASCII letters
$ - end of string (or use \z to match the very end of string).
If you meant to allow any word chars (letters, digits, connector punctuation) then replace each [a-zA-Z] with \w.
If you need to support just any Unicode letters, replace each [a-zA-Z] with \p{L}.
I recently was assigned an impossible task (in my estimation) to create a regex pattern in which I should be able to validate several words in the same sentence or textbox with the following guidelines:
Each name/word has to have first letter upper case
Names/Words separated by spaces
Each name/word 3 characters long or more
And the sentence or textbox text can't be longer than 20 characters
Example: Joseph Gordon Levitt
This example is exactly 20 characters long, each name (or word) is longer than 3 characters, separated by spaces, and the first letter of each name (or word) is upper case.
I tried this regex pattern ^[A-Z]{1}[a-zA-Z\s]{3,20}$. It works for some strings, but not all.
One of options is this:
^(?!.{21})[A-Z][a-z]{2,}(\s[A-Z][a-z]{2,})*$
Demo: https://dotnetfiddle.net/oWjSI4
Let's walk through the requirements:
Each name/word has to have first letter upper case: Use \p{Lu}
Names/Words separated by spaces: Use \s+ (1 or more spaces) / \s (only single space)
Each name/word 3 characters long or more: Word pattern will thus be \p{Lu}\p{L}{2,} - starting with an uppercase and then having 2 or more letters
And the sentence or textbox text can't be longer than 20 characters: Use a positive lookahead right after ^ / \A (start of string): (?!.{21}) or (?=.{0,20}$).
The resulting regex will look like
^(?!.{21})\p{Lu}\p{L}{2,}(?:\s\p{Lu}\p{L}{2,})*$
^(?=.{0,20}$)\p{Lu}\p{L}{2,}(?:\s\p{Lu}\p{L}{2,})*$
Or, if there can be 1+ whitespaces between words
^(?!.{21})\p{Lu}\p{L}{2,}(?:\s+\p{Lu}\p{L}{2,})*$
^(?=.{0,20}$)\p{Lu}\p{L}{2,}(?:\s+\p{Lu}\p{L}{2,})*$
NOTE: If you ever test it against a string that can end with a \n, newline char, replace $ with \z.
See the regex demo.
Details
^ - start of string
(?=.{0,20}$) - there must be 0 to 20 non-newline chars in the string till the end
\p{Lu} - an uppercase letter
\p{L}{2,} - two or more letters
(?:\s\p{Lu}\p{L}{2,})* - 0 or more repetitions of:
\s - a whitespace (or 1+ whitespaces if \s+ is used)
\p{Lu}\p{L}{2,} - an uppercase letter and then any two or more letters
$ - end of string (\z is the very end of the string).
I already gone through many post on SO. I didn't find what I needed for my specific scenario.
I need a regex for alpha numeric string.
where following conditions should be matched
Valid string:
ameya123 (alphabets and numbers)
ameya (only alphabets)
AMeya12(Capital and normal alphabets and numbers)
Ameya_123 (alphabets and underscore and numbers)
Ameya_ 123 (alphabets underscore and white speces)
Invalid string:
123 (only numbers)
_ (only underscore)
(only space) (only white spaces)
any special charecter other than underscore
what i tried till now:
(?=.*[a-zA-Z])(?=.*[0-9]*[\s]*[_]*)
the above regex is working in Regex online editor however not working in data annotation in c#
please suggest.
Based on your requirements and not your attempt, what you are in need of is this:
^(?!(?:\d+|_+| +)$)[\w ]+$
The negative lookahead looks for undesired matches to fail the whole process. Those are strings containing digits only, underscores only or spaces only. If they never happen we want to have a match for ^[\w ]+$ which is nearly the same as ^[a-zA-Z0-9_ ]+$.
See live demo here
Explanation:
^ Start of line / string
(?! Start of negative lookahead
(?: Start of non-capturing group
\d+ Match digits
| Or
_+ Match underscores
| Or
[ ]+ Match spaces
)$ End of non-capturing group immediately followed by end of line / string (none of previous matches should be found)
) End of negative lookahead
[\w ]+$ Match a character inside the character set up to end of input string
Note: \w is a shorthand for [a-zA-Z0-9_] unless u modifier is set.
One problem with your regex is that in annotations, the regex must match and consume the entire string input, while your pattern only contains lookarounds that do not consume any text.
You may use
^(?!\d+$)(?![_\s]+$)[A-Za-z0-9\s_]+$
See the regex demo. Note that \w (when used for a server-side validation, and thus parsed with the .NET regex engine) will also allow any Unicode letters, digits and some more stuff when validating on the server side, so I'd rather stick to [A-Za-z0-9_] to be consistent with both server- and client-side validation.
Details
^ - start of string (not necessary here, but good to have when debugging)
(?!\d+$) - a negative lookahead that fails the match if the whole string consists of digits
(?![_\s]+$) - a negative lookahead that fails the match if the whole string consists of underscores and/or whitespaces. NOTE: if you plan to only disallow ____ or " " like inputs, you need to split this lookahead into (?!_+$) and (?!\s+$))
[A-Za-z0-9\s_]+ - 1+ ASCII letters, digits, _ and whitespace chars
$ - end of string (not necessary here, but still good to have).
If I understand your requirements correctly, you need to match one or more letters (uppercase or lowercase), and possibly zero or more of digits, whitespace, or underscore. This implies the following pattern:
^[A-Za-z0-9\s_]*[A-Za-z][A-Za-z0-9\s_]*$
Demo
In the demo, I have replaced \s with \t \r, because \s was matching across all lines.
Unlike the answers given by #revo and #wiktor, I don't have a fancy looking explanation to the regex. I am beautiful even without my makeup on. Honestly, if you don't understand the pattern I gave, you might want to review a good regex tutorial.
This simple RegEx should do it:
[a-zA-Z]+[0-9_ ]*
One or more Alphabet, followed by zero or more numbers, underscore and Space.
This one should be good:
[\w\s_]*[a-zA-Z]+[\w\s_]*
Given 2 different lines I'm parsing, I need to extract the data points into regex match groups.
Example Line 1:
Header values are as follows:
DATE{space}TYPE{space}DESCR{space}VOLUME{space}RATE{space}TOTAL
[11/30/15] [CF] [DISC 1] [28270.18] [0.00150] [-42.41]
Example Line 2:
DATE{space}TYPE{space}DESCR{space}VOLUME{space}RATE{space}TOTAL
[11/30/15] [CF] [OTHER VOLUME FEES] [28186.68] [0.00008] [-2.25]
I'm using the following regex to get matches:
(?<date>^\d{1,2}[-/.]\d{1,2}[-/.]\d{1,2}[\d+])\s+(?<type>[A-Za-z]{2})\s+(?<descr>\w+\s+.*?(1))\s+.*?(?<volume>(\d+(?:\.\d+?))\s+.*?(?<rate>([0]?(\d+(?:\.\d+)?)))\s+(?<total>[-+]?\d+[.,]\d+)?.*$")
I can match the first case,but never the second case. there will always be a total, but they may NOT always be volume or rate. In addition, volume can be whole, decimal or code (e.g. "1B").
What am I missing here?
The description field is an open field and may contain "1" in it. I can have several words in it, or just 1.
Your log lines contain 6 fields, but the 4th and 5th can go missing. A common way to match optional fields is using an optional non-capturing group, (?:...)?. These groups do not make a separate memory buffers for the text they match, that is why they are useful to keep matching cleaner and more efficient.
NOTE that in .NET, there is a way to make all non-named capturing groups non-capturing by use of RegexOptions.ExplicitCapture option.
Your fixed regex mau look like
^(?<date>\d{1,2}[-/.]\d{1,2}[-/.]\d{1,2})\s+(?:(?<type>[A-Z]{2})\s+)?(?:(?<descr>\w.*?)\s+)?(?:(?<volume>\d*\.?\d+)\s+)?(?:(?<rate>\d*\.?\d+)\s+)?(?<total>[-+]?\d*[.,]?\d+)\s*$
See the .NET regex demo.
Details
^ - start of a line (when RegexOptions.Multiline is used)
(?<date>\d{1,2}[-/.]\d{1,2}[-/.]\d{1,2}) - Group "date": 1-2 digits and then 2 repetitions of -///. followed with 1-2 digits (thus, this pattern can be written as (?<date>\d{1,2}(?:[-/.]\d{1,2}){2})).
\s+ - 1 or more whitespaces
(?:(?<type>[A-Z]{2})\s+)? - an optional group matching 2 uppercase ASCII letters, captured into Group "type", and then 1+ whitespaces
(?:(?<descr>\w.*?)\s+)? - an optional group matching a word char (letter, digit or _ and some other special chars (like diacritics) followed with any 0+ chars other than a newline char LF, as few as possible, all this captured into Group "descr", and then 1+ whitespaces
(?:(?<volume>\d*\.?\d+)\s+)? - an optional group matching 0+ digits, an optional . and then 1+ digits (that is, floats or integers) captured into Group "volume", then 1+ whitespace chars
(?:(?<rate>\d*\.?\d+)\s+)? - an optional group matching a float or integer values captured into Group "rate", and then 1+ whitespace chars
(?<total>[-+]?\d*[.,]?\d+) - Group "total": an optional - or + followed with 0+ digits, an optional . or , and then 1+ digits (so, positive or negative floats or integers are matched)
\s* - any 0+ trailing whitespaces
$ - end of the line.
(?<date>^\d{1,2}[-/.]\d{1,2}[-/.]\d{1,2}[\d+])\s+(?<type>[A-Z]{2})\s+(?<descr>\w+.*?\s+)(?<volume>\d+[.]?\d+)\s+(?<rate>\d+[.]?\d+)\s+(?<total>[-+]?\d+[.,]\d+?.*$)
Yes. This is a fairly complex regex. But if you have varying spaces inside your grouping, you can use .*?\s+ to end on the last space. This seems to work nicely for all the use cases I have.
Thanks for your comments!
I am trying to match the following pattern.
A minimum of 3 'groups' of alphanumeric characters separated by a hyphen.
Eg: ABC1-AB-B5-ABC1
Each group can be any number of characters long.
I have tried the following:
^(\w*(-)){3,}?$
This gives me what I want to an extent.
ABC1-AB-B5-0001 fails, and ABC1-AB-B5-0001- passes.
I don't want the trailing hyphen to be a requirement.
I can't figure out how to modify the expression.
Your ^(\w*(-)){3,}?$ pattern even allows a string like ----- because the only required pattern here is a hyphen: \w* may match 0 word chars. The - may be both leading and trailing because of that.
You may use
\A\w+(?:-\w+){2,}\z
Details:
\A - start of string
\w+ - 1+ word chars (that is, letters, digits or _ symbols)
(?:-\w+){2,} - 2 or more sequences of:
- - a single hyphen
\w+ - 1 or more word chars
\z - the very end of string.
See the regex demo.
Or, if you do not want to allow _:
\A[^\W_]+(?:-[^\W_]+){2,}\z
or to only allow ASCII letters and digits:
\A[A-Za-z0-9]+(?:-[A-Za-z0-9]+){2,}\z
It can be like this:
^\w+-\w+-\w+(-\w+)*$
^(\w+-){2,}(\w+)-?$
Matches 2+ groups separated by a hyphen, then a single group possibly terminated by a hyphen.
((?:-?\w+){3,})
Matches minimum 3 groups, optionally starting with a hyphen, thus ignoring the trailing hyphen.
Note that the \w word character also select the underscore char _ as well as 0-9 and a-z
link to demo