Splitting string into groups with Regular Expression - c#

I need a Regular expression that allows me to split the following string in c#:
$1.89 BROWN RICE ‐ 16 03/01 ‐ 03/07 1.29
into something like this:
Group 1: BROWN RICE - 16
Group 2: 03/01 ‐ 03/07
Group 3: 1.29
Is it possible to achieve this with Regex?

In your case I think a regex will be better than splitting.
If it's original price (Product - Qty) (date range) (sale price), you can try something like
\$?\d+\.\d{2} ([A-Za-z ]+- *\d+) +(\d{2}/\d{2} *- *\d{2}/\d{2}) +\$?(\d+\.\d{2})
Title & Quantity are in captured group 1, date range in group 2, new price in group 3.
Explanation:
\$?\d+\.\d{2}: a price, optional dollar sign, exactly two decimal places (for the cents). If you want to allow '$1' (ie no decimal places) then modify accordingly.
([A-Za-z ]+- *\d+) Object name and quantity (separated by a hyphen). You may wish to modify this regex depending on the expected names you will get in (perhaps they aren't just consisting of letters and spaces).
(\d{2}/\d{2} *- *\d{2}/\d{2}) date range. I have no idea if yours is month/day or day/month, but depending on that you can make your regex a lot more exclusive if you wish (for example, a numeric date is ([012]\d|3[01]) and a month only goes from 1 to 12).
\$?(\d+\.\d{2}) the saleprice.

Have you tried having a go with something like regexpal? Makes it easy to test out how to filter the data you're interested in. There's a bunch of hints in the top right which basically describes how to write regular expressions too...
First we want to capture the price but we dont' care about it so the ? ignores that group:
(?:\$\d+\.\d+)
Since we know what the third part should look like, our first section of interest can gobble up anything in the middle:
(.*)
Next we want to match that date range:
(\d{2}/\d{2} ‐ \d{2}/\d{2})
And finally we want a floating point number:
(\d+\.\d+)
So in conclusion, something like this should work:
(?:\$\d+\.\d+) (.*) (\d{2}/\d{2} ‐ \d{2}/\d{2}) (\d+\.\d+)
You'll need to escape the backslashes to include that in c#

(\$\d\.\d{2}) (.*?) (\d{2}/\d{2} - \d{2}/\d{2}) (.*)
This works on your example. It may need to be improved if you have any more data variations
(\$\d\.\d{2}) - Match the price $0.00
- If prices can be more than $9 then you'd need to
make this match one or more (\$\d+\.\d{2})
(.*?) - Lazy match everything till the next group
(\d{2}/\d{2} - \d{2}/\d{2}) - Match the date range
(.*) - Match what ever is left
You may also wish to put start and end line constraints if you're reading a bunch of these from a text file.

/^\$\d*\.\d{2,}\s([^-]+\s[-]\s\d+)\s(\d{2}\/\d{2}\s[-]\s\d{2}\/\d{2})\s(\d*\.\d{2,})$/
Group 1 : Brown Rice - 16
Group 2 : 03/01 ‐ 03/07
Group 3 : 1.29 (will also match 0.29 and .29)

Try
(\$\d+\.\d+)\s(.*?)\s(\d{2}/\d{2}\s-\s\d{2}/\d{2})\s(\d+\.\d+)
(\$\d+\.\d+) matches the price in dollars
(.*?) matches the product name
(\d{2}/\d{2}\s-\s\d{2}/\d{2}) matches the date range
(\d+\.\d+) matches the second price
I noted that the minus sign (-) in your example uses a different character code as the standard minus sign. Therefore, my Regex did not want to work until I replaced your "-" by normal ones.

Related

How to validate Regex

Im having a hard time with grouping parts of a Regex. I want to validate a few things in a string that follows this format: I-XXXXXX.XX.XX.XX
Validate that the first set of 6 X's (I-xxxxxx.XX.XX.XX) does not contain characters and its length is no more than 6.
Validate that the third set of X's (I-XXXXXX.XX.xx.XX) does not contain characters and is only 1 or 2.
Now, I have already validation on the last set of XX's to make sure the numbers are 1-8 using
string pattern1 = #"^.+\.(0?[1-8])$";
Match match = Regex.Match(TxtWBS.Text, pattern1);
if (match.Success)
;
else
{ errMessage += "WBS invalid"; errMessage +=
Environment.NewLine; }
I just cant figure out how to target specific parts of the string. Any help would be greatly appreciated and thank you in advance!
You're having some trouble adding new validation to this string because it's very generic. Let's take a look at what you're doing:
^.+\.(0?[1-8])$
This finds the following:
^ the start of the string
.+ everything it can, other than a newline, basically jumping the engine's cursor to the end of your line
\. the last period in the string, because of the greedy quantifier in the .+ that comes before it
0? a zero, if it can
[1-8] a number between 1 and 8
()$ stores the two previous things in a group, and if the end of the string doesn't come after this, it may even backtrace and try the same thing from the second to last period instead, which we know isn't a great strategy.
This ends up matching a lot of weird stuff, like for example the string The number 0.1
Let's try patterning something more specific, if we can:
^I-(\d{6})\.(\d{2})\.(\d{1,2})\.([1-8]{2})$
This will match:
^I- an I and a hyphen at the start of the string
(\d{6}) six digits, which it stores in a capture group
\. a period. By now, if there was any other number of digits than six, the match fails instead of trying to backtrace all over the place.
(\d{2})\. Same thing, but two digits instead of six.
(\d{1,2})\. Same thing, the comma here meaning it can match between one and two digits.
([1-8]{2}) Two digits that are each between 1 and 8.
$ The end of the string.
I hope I understood what exactly you're trying to match here. Let me know if this isn't what you had in mind.
This regex:
^.-[0-9]{6}(\.[1-8]{1,2}){3}$
will validate the following:
The first character can be any character, but is of length 1
It is followed by a dash
The dash is followed by exactly 6 numbers 0 - 9. (If this could be less than 6 characters - for example, between 3 and 6 characters - just replace {6} with {3,6}).
This is followed by 3 groups of characters. Each of this groups are proceeded by a period, are of length 1 or 2, and can be any number 1 - 8.
An example of a valid string is:
I-587954.12.34.56
This is also valid:
I-587954.1.3.5
But this isn't:
I-587954.12.80.356
because the second-to-last group contains a 0, and because the last group is of length 3.
Pleas let me know if I have misunderstood any of the rules.
^I-([0-9]{1,6})\.(.{1,2})\.(0[1-2])\.(.{1,2})$
groups delimited by . (\.) :
([0-9]{1,6}) - 1-6 digits
(.{1,2}) - 1-2 any single character
(0[1-2]) - 01 or 02
(.{1,2}) - 1-2 any single character
you can write and easy test regex on your input data, just google "regex online"

How to mask first 6 and last 4 digits for a credit card number in .net

I'm very new to regex And I'm trying to use a regular expression to turn a credit card number which will be part of a conversation into something like 492900******2222
As it can come from any conversation it might contain string next to it or might have an inconsistent format, so essentially all of the below should be formatted to the example above:
hello my number is492900001111222
number is 4929000011112222ok?
4929 0000 1111 2222
4929-0000-1111-2222
It needs to be a regular expression which extracts the capture group of which I will then be able to use a MatchEvaluator to turn all digits (excluding non digits) which are not the first 6 and last 4 into a *
I've seen many examples here on stack overflow for PHP and JS but none which helps me resolve this issue.
Any guidance will be appreciated
UPDATE
I need to expand upon an existing implementation which uses MatchEvaluator to mask each character that is not the first 6 or last 4 and ideally I dont want to change the MatchEvaluator and just make the masking flexible based on the regular expression, see this for an example https://dotnetfiddle.net/J2LCo0
UPDATE 2
#Matt.G and #CAustin answers do resolve what I asked for but I am hitting another barrier where I cant have it be so strict. The final captured group needs to only take into account the digits and as such maintain the format of the input text.
So for example:
If some types in my card number is 99 9988 8877776666 the output from the evaluation should be 99 9988 ******666666
OR
my card number is 9999-8888-7777-6666 it should output 9999-88**-****-6666.
Is this possible?
Changed the list to include items that are in my unit tests https://dotnetfiddle.net/tU6mxQ
Try Regex: (?<=\d{4}\d{2})\d{2}\d{4}(?=\d{4})|(?<=\d{4}( |-)\d{2})\d{2}\1\d{4}(?=\1\d{4})
Regex Demo
C# Demo
Explanation:
2 alternative regexes
(?<=\d{4}\d{2})\d{2}\d{4}(?=\d{4}) - to handle cardnumbers without any separators (- or <space>)
(?<=\d{4}( |-)\d{2})\d{2}\1\d{4}(?=\1\d{4}) - to handle cardnumbers with separator (- or <space>)
1st Alternative (?<=\d{4}\d{2})\d{2}\d{4}(?=\d{4})
Positive Lookbehind (?<=\d{4}\d{2}) - matches text that has 6 digits immediately behind it
\d{2} matches a digit (equal to [0-9])
{2} Quantifier — Matches exactly 2 times
\d{4} matches a digit (equal to [0-9])
{4} Quantifier — Matches exactly 4 times
Positive Lookahead (?=\d{4}) - matches text that is followed immediately by 4 digits
Assert that the Regex below matches
\d{4} matches a digit (equal to [0-9])
{4} Quantifier — Matches exactly 4 times
2nd Alternative (?<=\d{4}( |-)\d{2})\d{2}\1\d{4}(?=\1\d{4})
Positive Lookbehind (?<=\d{4}( |-)\d{2}) - matches text that has (4 digits followed by a separator followed by 2 digits) immediately behind it
1st Capturing Group ( |-) - get the separator as a capturing group, this is to check the next occurence of the separator using \1
\1 matches the same text as most recently matched by the 1st capturing group (separator, in this case)
Positive Lookahead (?=\1\d{4}) - matches text that is followed by separator and 4 digits
If performance is a concern, here's a pattern that only goes through 94 steps, instead of the other answer's 473, by avoiding lookaround and alternation:
\d{4}[ -]?\d{2}\K\d{2}[ -]?\d{4}
Demo: https://regex101.com/r/0XMluq/4
Edit: In C#'s regex flavor, the following pattern can be used instead, since C# allows variable length lookbehind.
(?<=\d{4}[ -]?\d{2})\d{2}[ -]?\d{4}
Demo

Regex Parsing Beginning of Line

I have a string and I would like to parse it using regular expression. .. indicates the category name and everything after : is the content for that category.
Below is the full string I'm trying to parse:
..NAME: JOHN
..BDAY: 1/1/2010
..NOTE: 1. some note 1
2. some note 2
3. some note 3
..DATE: 6/3/2014
I'm trying to parse it so that
(group 1)
..NAME: JOHN
(group 2)
..BDAY: 1/1/2010
(group 3)
..NOTE: 1. some note 1
2. some note 2
3. some note 3
(group 4)
..DATE: 6/3/2014 //a.k.a update date
The regular expression patter I use is
\.\.[A-Z0-9]{2,4}:.*
which makes (group 3) ..NOTE: 1. some note 1 missing the content on second and third line.
How can I modify my pattern so I can get the correct grouping?
. matches all but newline (in most languages, Ruby is one exception). Use RegexOptions.Singleline in C# (or the s modifier in PCRE).
You will need to make your .* lazy up till the next .. or the end of the string $ so that you don't match everything the first time. Also, . doesn't have any special meaning in a character class..so your expression may end up looking cleaner like this:
[.]{2}[A-Z0-9]{2,4}:.*?(?=[.]{2}|$)
Demos: Regex and C#
I managed to achieve it with the negative lookahead for [.]{2}:
[.]{2}[A-Z0-9]{2,4}:(.*\n?(?![.]{2}))*

Need C# Regex to match a four digit sequence, but ignore any single digits peceeding

OK, I need to improve this question. Let me try this again:
I need to parse out a flight time which comes after an airport code, but may have a single digit and white space between the two.
Example data:
ORD 1100
HOU 1 1215
MAD 4 1300
I tried this:
([A-Z]{3})\s?\d?\s?(\d{4})
I end up with the airport code and a single digit.
I need a regex that will ignore everything after the airport code except the 4 digit flight time.
Hope I improved my question.
The solution might be as simple as:
\d{4}
According to your inputs you don't need to care about preceeding digits..
This is the answer I would use:
#"([A-Z]{3})\s+(?:[0-9]\s+)?([0-9]{4})"
Basically it is very similar to what you were attempting to do.
The first part is ([A-Z]{3}), which looks for 3 uppercase letters and assigns them to group 1 (Group 0 is the entire string).
The second part is \s+(?:[0-9]\s+)?, which requires at least one space, with the possibility of 1 digit in there somewhere. The noncapturing group in the middle requires that if there is a single digit there, it must be followed by at least 1 space. This prevents a mismatch for something like ABC 12345.
Next we have ([0-9]{4}), which simply matched the 4 digits you are looking for. These can be found in group 2. I use [0-9] here since \d refers to more digits than what we are used to (Like Eastern Arabic numerals).
Here's a little something, using lookbehind and lookahead to be sure there are only 4 digits, with non-digits (or beginning/end) surrounding them.
"(?<=[^\d]|^)\d{4}(?=[^\d]|$)"
The two [^\d] can be replaced with [\s] to only match 4-digits with whitespace around them.
Update:
With your latest update, I merged my regex with yours (from the comment) and came up with this:
"(?<=[A-Z]{3}\s(\d\s)?)\d{4}(?=\s|$)"
There are three parts to the pattern. First is the lookbehind: (?<=PatternHere). The pattern inside this must occur/match before what we seek.
The next part is our simple main pattern: \d{4}, four digits.
The last part is the lookahead: (?=PatternHere), which is pretty much the same as lookbehind, but checks the other side, forward.

Regex to match on capital letter, digit or capital, lowercase, and digit

I'm working on an application which will calculate molecular weight and I need to separate a string into the different molecules. I've been using a regex to do this but I haven't quite gotten it to work.
I need the regex to match on patterns like H2OCl4 and Na2H2O where it would break it up into matches like:
H2
O
Cl4
Na2
H2
O
The regex i've been working on is this:
([A-Z]\d*|[A-Z]*[a-z]\d*)
It's really close but it currently breaks the matches into this:
H2
O
C
l4
I need the Cl4 to be considered one match. Can anyone help me with the last part i'm missing in this. I'm pretty new to regular expressions. Thanks.
I think what you want is "[A-Z][a-z]?\d*"
That is, a capital letter, followed by an optional small letter, followed by an optional string of digits.
If you want to match 0, 1, or 2 lower-case letters, then you can write:
"[A-Z][a-z]{0,2}\d*"
Note, however, that both of these regular expressions assume that the input data is valid. Given bad data, it will skip over bad data. For example, if the input string is "H2ClxxzSO4", you're going to get:
H2
Clx
S
O4
If you want to detect bad data, you'll need to check the Index property of the returned Match object to ensure that it is equal to the beginning index.
Note that if you expect international characters in your input such as letters with diacritic marks (ñ,é,è,ê,ë, etc), then you should use the corresponding unicode category. In your case, what you want is #"\p{Lu}\p{Ll}?\d*".

Categories