C# Regular Expression Help - grouping with optional data points - c#

Given 2 different lines I'm parsing, I need to extract the data points into regex match groups.
Example Line 1:
Header values are as follows:
DATE{space}TYPE{space}DESCR{space}VOLUME{space}RATE{space}TOTAL
[11/30/15] [CF] [DISC 1] [28270.18] [0.00150] [-42.41]
Example Line 2:
DATE{space}TYPE{space}DESCR{space}VOLUME{space}RATE{space}TOTAL
[11/30/15] [CF] [OTHER VOLUME FEES] [28186.68] [0.00008] [-2.25]
I'm using the following regex to get matches:
(?<date>^\d{1,2}[-/.]\d{1,2}[-/.]\d{1,2}[\d+])\s+(?<type>[A-Za-z]{2})\s+(?<descr>\w+\s+.*?(1))\s+.*?(?<volume>(\d+(?:\.\d+?))\s+.*?(?<rate>([0]?(\d+(?:\.\d+)?)))\s+(?<total>[-+]?\d+[.,]\d+)?.*$")
I can match the first case,but never the second case. there will always be a total, but they may NOT always be volume or rate. In addition, volume can be whole, decimal or code (e.g. "1B").
What am I missing here?
The description field is an open field and may contain "1" in it. I can have several words in it, or just 1.

Your log lines contain 6 fields, but the 4th and 5th can go missing. A common way to match optional fields is using an optional non-capturing group, (?:...)?. These groups do not make a separate memory buffers for the text they match, that is why they are useful to keep matching cleaner and more efficient.
NOTE that in .NET, there is a way to make all non-named capturing groups non-capturing by use of RegexOptions.ExplicitCapture option.
Your fixed regex mau look like
^(?<date>\d{1,2}[-/.]\d{1,2}[-/.]\d{1,2})\s+(?:(?<type>[A-Z]{2})\s+)?(?:(?<descr>\w.*?)\s+)?(?:(?<volume>\d*\.?\d+)\s+)?(?:(?<rate>\d*\.?\d+)\s+)?(?<total>[-+]?\d*[.,]?\d+)\s*$
See the .NET regex demo.
Details
^ - start of a line (when RegexOptions.Multiline is used)
(?<date>\d{1,2}[-/.]\d{1,2}[-/.]\d{1,2}) - Group "date": 1-2 digits and then 2 repetitions of -///. followed with 1-2 digits (thus, this pattern can be written as (?<date>\d{1,2}(?:[-/.]\d{1,2}){2})).
\s+ - 1 or more whitespaces
(?:(?<type>[A-Z]{2})\s+)? - an optional group matching 2 uppercase ASCII letters, captured into Group "type", and then 1+ whitespaces
(?:(?<descr>\w.*?)\s+)? - an optional group matching a word char (letter, digit or _ and some other special chars (like diacritics) followed with any 0+ chars other than a newline char LF, as few as possible, all this captured into Group "descr", and then 1+ whitespaces
(?:(?<volume>\d*\.?\d+)\s+)? - an optional group matching 0+ digits, an optional . and then 1+ digits (that is, floats or integers) captured into Group "volume", then 1+ whitespace chars
(?:(?<rate>\d*\.?\d+)\s+)? - an optional group matching a float or integer values captured into Group "rate", and then 1+ whitespace chars
(?<total>[-+]?\d*[.,]?\d+) - Group "total": an optional - or + followed with 0+ digits, an optional . or , and then 1+ digits (so, positive or negative floats or integers are matched)
\s* - any 0+ trailing whitespaces
$ - end of the line.

(?<date>^\d{1,2}[-/.]\d{1,2}[-/.]\d{1,2}[\d+])\s+(?<type>[A-Z]{2})\s+(?<descr>\w+.*?\s+)(?<volume>\d+[.]?\d+)\s+(?<rate>\d+[.]?\d+)\s+(?<total>[-+]?\d+[.,]\d+?.*$)
Yes. This is a fairly complex regex. But if you have varying spaces inside your grouping, you can use .*?\s+ to end on the last space. This seems to work nicely for all the use cases I have.
Thanks for your comments!

Related

How can I match four number fields or a text string using regex

So a sensor I'm interfacing to either outputs 4 multi-digit integers (separated by spaces) or an error string.
Ideally my regex would return a match for either of the above scenarios and reject any
other outputs - e.g. if only 3 numbers are output. I can then check if there are 4 groups (number output) or 1 group (error string output) in the following c#.
The regex I have matches all the time and returns spaces when there are less than 4 numbers so I still need to check everything.
I've tried putting in ?: but the format breaks. Any regex whizzes up to the challenge? Thanks in advance.
([0-9]+\s)([0-9]+\s)([0-9]+\s)([0-9]+)|([a-zA-Z\s_!]+)
So a numeric example would be 11 222 33 4444 or Sensor is in an error state! An incorrect output would be 222 11 3333 as it only has 3 fields
Also - I need to capture the four numbers (but not the spaces) or the error string.
You can capture either the 4 groups with only digits and match the whitespace chars outside of the group.
Or else match 1+ times any of the listed characters in the character class. Note that \s can also match a newline, and as the \s is in the character class the match can also consist of only spaces for example.
To match the whole string, you can add anchors.
^(?:([0-9]+)\s([0-9]+)\s([0-9]+)\s([0-9]+)|[a-zA-Z\s_!]+)$
» Regex demo
Another option to match the error string, is to start matching word characters without digits, optionally repeated by a whitespace char and again word characters without digits.
^(?:([0-9]+)\s([0-9]+)\s([0-9]+)\s([0-9]+)|[^\W\d]+(?:\s+[^\W\d]+)*!?)$
» Regex demo
If there can be a digit in the error message, but you don't want to match only digits or whitespace chars, you can exclude that using a negative lookahead.
^(?:([0-9]+)\s([0-9]+)\s([0-9]+)\s([0-9]+)|(?![\d\s]*$).+)$
» Regex demo

How to extract middle values in regex

I am trying to extract the fifth and sixth value present in the stream through regex.
The stream is
12,097.00 435.00 100.00 43,037.00 3,090.00 200.00 86.00 45,890.47 7,570.00 51,514.47
I want values 200.00 and 100.00.
I tried ^(?:\S+\s+\n?){3,3} but it's selecting the string from beginning.
Can anybody help me please in getting the values that are present in the middle?
Using a quantifier like {3,3} can be written as {3}, but note that in the example string the values 200.00 and 100.00 are not the 5th and the 6th value.
With your pattern you only get the values at the beginning as the anchor ^ asserts the start of the string.
To get the third and the sixth value, you could also use 2 capture groups by using a quantifier {2} for the parts in between.
^(?:\S+\s+){2}(\S+)(?:\s+\S+){2}\s+(\S+)
^ Start of string
(?:\S+\s+){2} Repeat 2 times matching non whitespace chars followed by whitespace char
(\S+) Capture group 1, match 1+ non whitespace chars
(?:\s+\S+){2}\s+ Repeat 2 times matching whitespace chars and non whitespace chars
(\S+) Capture group 2, match 1+ non whitespace chars
Regex demo
Certainly, if you have access to the code itself, it would be easier to split the string and get nth chunk by its index.
If you are limited to a regex, you can use
(?<=^(?:\S+\s+){2})\S+
(?<=^(?:\S+\s+){5})\S+
Or, if there can be leading whitespaces:
(?<=^\s*(?:\S+\s+){2})\S+
(?<=^\s*(?:\S+\s+){5})\S+
See a .NET regex demo.
Details:
(?<= - start of a positive lookbehind that requires the following sequence of patterns to appear immediately to the left of the current location:
^ - start of string
\s* - zero or more whitespaces
(?:\S+\s+){2} - two occurrences of 1+ non-whitespace chars followed with 1+ whitespace chars
) - end of the lookbehind
\S+ - one or more non-whitespace chars.

Regex to extract string between digit pattern and colon or newline

I have to extract string between digit pattern and either a colon or newline (first occurence)
my string would look like:
05-30-1306-29-13 BUILDERS RISK:
LIMITS/DEDUCTIBLES:
I would like to extract BUILDERS RISK. There may or may not be a colon, in such case we will treat newline as the terminating pattern
Here's what I have come up with so far
\d{2}-\d{2}-\d{4}-\d{2}-\d{2}\s*\W+[^:|\n]+:\s*
Numerical pattern will always be 2-2-4-2 followed by any string followed by either \n or :
The regex so far gets what I need but I don't know how to break it into different matches so I can take the second match
1st match - digit pattern
2nd match - what i need
3rd match - colon or newline
Any pointers will be helpful.
UPDATE: Couple of alternatives of the text term to be searched could be this
11-06-1212-29-12 DWELLING FIRE (DP-3): ANNUAL RENTAL
11-05-1212-26-12 HOMEOWNERS (HO-3): SECONDARY HOME
I would only want anything before colon or if that is not there, take string till newline is found. As a side note, the text of significance may not be present in same line and appear in next line but will always be followed by either a colon or newline in the same line.
PS: Extracted text should not contain colon
It appears you may use
\b(\d{2}-\d{2}-\d{4}-\d{2}-\d{2})\W+(.*?)(:?\r?\n\s*)
See the regex demo yielding
Details
\b - a word boundary (change to (?<!\d) if the digits can be glued to a letter or underscore)
(\d{2}-\d{2}-\d{4}-\d{2}-\d{2}) - Group 1: two digits, -, two digits, -, four digits, -, two digits, -, two digits
\W+ - 1+ non-word chars (to stay on the line, replace with [^\w\r\n]+)
(.*?) - Group 2: any zero or more chars other than newline, as few as possible
(:?\r?\n\s*) - Group 3: an optional :, an optional CR, an LF symbol and then any 0+ whitespace chars.

C# Regular Expression for x number of groups of A-Z separated by hyphen

I am trying to match the following pattern.
A minimum of 3 'groups' of alphanumeric characters separated by a hyphen.
Eg: ABC1-AB-B5-ABC1
Each group can be any number of characters long.
I have tried the following:
^(\w*(-)){3,}?$
This gives me what I want to an extent.
ABC1-AB-B5-0001 fails, and ABC1-AB-B5-0001- passes.
I don't want the trailing hyphen to be a requirement.
I can't figure out how to modify the expression.
Your ^(\w*(-)){3,}?$ pattern even allows a string like ----- because the only required pattern here is a hyphen: \w* may match 0 word chars. The - may be both leading and trailing because of that.
You may use
\A\w+(?:-\w+){2,}\z
Details:
\A - start of string
\w+ - 1+ word chars (that is, letters, digits or _ symbols)
(?:-\w+){2,} - 2 or more sequences of:
- - a single hyphen
\w+ - 1 or more word chars
\z - the very end of string.
See the regex demo.
Or, if you do not want to allow _:
\A[^\W_]+(?:-[^\W_]+){2,}\z
or to only allow ASCII letters and digits:
\A[A-Za-z0-9]+(?:-[A-Za-z0-9]+){2,}\z
It can be like this:
^\w+-\w+-\w+(-\w+)*$
^(\w+-){2,}(\w+)-?$
Matches 2+ groups separated by a hyphen, then a single group possibly terminated by a hyphen.
((?:-?\w+){3,})
Matches minimum 3 groups, optionally starting with a hyphen, thus ignoring the trailing hyphen.
Note that the \w word character also select the underscore char _ as well as 0-9 and a-z
link to demo

C# Regex Match 15 Characters, Single Spaces, Alpha-Numeric

I need to match a string under the following conditions using Regex in C#:
Entire string can only be alphanumeric (including spaces).
Must be a maximum of 15 characters or less (including spaces).
First & last characters can only be a letter.
A single space can appear multiple times in anywhere but the first and last characters of the string. (Multiple spaces together should not be allowed).
Capitalization should be ignored.
Should match the WHOLE word(s).
If any one of these preconditions are broken, a match should not follow.
Here is what i currently have:
^\b([A-z]{1})(([A-z0-9 ])*([A-z]{1}))?\b$
And here are some test strings that should match:
Stack OverFlow
Iamthe greatest
A
superman23s
One Two Three
And some that shouldn't match (note the spaces):
Stack [double_space] Overflow Rocks
23Hello
ThisIsOver15CharactersLong
Hello23
[space_here]hey
etc.
Any suggestions would be much appreciated.
You should use lookaheads
|->matches if all the lookaheads are true
--
^(?=[a-zA-Z]([a-zA-Z\d\s]+[a-zA-Z])?$)(?=.{1,15}$)(?!.*\s{2,}).*$
-------------------------------------- ---------- ----------
| | |->checks if there are no two or more space occuring
| |->checks if the string is between 1 to 15 chars
|->checks if the string starts with alphabet followed by 1 to many requireds chars and that ends with a char that is not space
you can try it here
Try this regex: -
"^([a-zA-Z]([ ](?=[a-zA-Z0-9])|[a-zA-Z0-9]){0,13}[a-zA-Z])$"
Explanation : -
[a-zA-Z] // Match first character letter
( // Capture group
[ ](?=[a-zA-Z0-9]) // Match either a `space followed by non-whitespace` (Avoid double space, but accept single whitespace)
| // or
[a-zA-Z0-9] // just `non-whitespace` characters
){0,13} // from `0 to 13` character in length
[a-zA-Z] // Match last character letter
Update : -
To handle single characters, you can make the pattern after 1st character optional as nicely pointed by #Rawling in comments: -
"^([a-zA-Z](([ ](?=[a-zA-Z0-9])|[a-zA-Z0-9]){0,13}[a-zA-Z])?)$"
^^^ ^^^
use a capture group make it optional
And my version, again using look-aheads:
^(?=.{1,15}$)(?=^[A-Z].*)(?=.*[A-Z]$)(?![ ]{2})[A-Z0-9 ]+$
explained:
^ start of string
(?=.{1,15}$) positive look-ahead: must be between 1 and 15 chars
(?=^[A-Z].*) positive look-ahead: initial char must be alpha
(?=.*[A-Z]$) positive look-ahead: last char must be alpha
(?![ ]{2}) negative look-ahead: string mustn't contain 2 or more consecutive spaces
[A-Z0-9 ]+ if all the look-aheads agree, select only alpha-numeric chars + space
$ end of string
This will also need the IgnoreCase option setting

Categories