RegEx pattern to capture invoice line items containing unit prices in description - c#

Using C#, I am attempting to extract individual invoice line items from a block of text containing ALL the line items. For each line item, I want to separate and capture the Line Item Code, Line Item Description, and Line Item Dollar Amount. The issue is that many of the line item descriptions include decimal amounts similar to dollar amounts, so the regex I am using is capturing several entire line items into one line item description. How can I alter my regex statement to include these decimal numbers in the description, while still separating prices into another match group? I am also open to other optimization suggestions
Here is the block of line items that is giving me trouble:
1244 Drayage Charge MEDU2265085
1,875.00
4083 Chassis MEDU2265085 TRIAXLE 4 DAYS
640.00
1268 Pre-Pull MEDU2265085
250.00
1248 Truck Waiting & Over Time MEDU2265085 3.5*120
420.00
1244 Drayage Charge MEDU3325790
1,875.00
4083 Chassis MEDU3325790 TRIAXLE 4 DAYS
640.00
1268 Pre-Pull MEDU3325790
250.00
1248 Truck Waiting & Over Time MEDU3325790 2.38*120
285.60
1244 Drayage Charge MSCU3870551
1,875.00
4083 Chassis MSCU3870551 TRIAXLE 4 DAYS
640.00
1268 Pre-Pull MSCU3870551
250.00
1248 Truck Waiting & Over Time MSCU3870551 3.5*120
420.00
And here is my best attempt at a regex pattern:
(?<LINE_ITEM_CODE>[0-9]{4})[\r\s\n](?<LINE_ITEM_DESCRIPTION>[A-Za-z0-9\r\s\n\-\%\&\*\.]*)[\r\n\s](?<LINE_ITEM_AMOUNT>[0-9\,]{1,7}.[0-9]{2})
If you punch these in over at regexr.com or regexstorm.net, you'll see that several of the line items are being captured as a single line item description. The alternative I had been using previously did not accommodate the 3.5, 2.38 etc. How can I target the prices while still grouping the other decimals into the description?
I'm open to alternative solutions

You can use
(?m)^(?<LINE_ITEM_CODE>\d{4})\s+(?<LINE_ITEM_DESCRIPTION>.*?)\r?\n(?<LINE_ITEM_AMOUNT>\d{1,3}(?:,\d{3})*\.\d{2})
See the regex demo.
Details:
(?m)^ - a multiline flag that makes ^ match start of a line
(?<LINE_ITEM_CODE>\d{4}) - Group "LINE_ITEM_CODE": four digits
\s+ - one or more whitespaces (including newlines)
(?<LINE_ITEM_DESCRIPTION>.*?) - Group "LINE_ITEM_DESCRIPTION": any zero or more chars other than newline chars as few as possible
\r?\n - CRLF or LF
(?<LINE_ITEM_AMOUNT>\d{1,3}(?:,\d{3})*\.\d{2}) - Group "LINE_ITEM_AMOUNT": one to three digits and then zero or more repetitions of a comma and three digits and then a dot and two digits.
`

Related

Matching a sequence of characters splitted by spaces after a prefix

I have the following strings:
-prefix <#141222969505480701> where the second part e.g. <#141222969505480701> can be repeated unlimited times (only the numbers change).
-prefix 141222969505480701 which should behave the same as above.
-prefix 141222969505480701 <#141222969505480702> which would still be able to repeat itself forever.
The last one should have groups containing 141222969505480701 and 141222969505480702.
So a few bits of information:
The digit chains are always 18 in total so I use \d{18} in my regex
I would like to have the numbers in groups for me to use them afterwards.
What I tried
First of I tried to match the first of my example strings.
-prefix(\s<#\d{18}>)\1* which would match the entire string, but I would like to have the digits itself in its own group. Also this method only matches the same parts e.g. <#141222969505480701> <#141222969505480701> <#141222969505480701> would match, but any other number in between wouldn't match.
What would sound logical in my head
-prefix (\d{18})+ but it would only match the first one of the 'digit parts'.
While I was testing it on regex101 it told me the following:
A repeated capturing group will only capture the last iteration. Put a capturing group around the repeated group to capture all iterations or use a non-capturing group instead if you're not interested in the data.
I tried to adjust the regex to the following -prefix ((\d{18})+), but with the same result.
With the help to of #madreflection in the comments I was able to come up with this solution:
-prefix([\s]*(<#|)(?<digits>[0-9]{18})>?)+
Which is exactly what I needed, which even ignores spaces in between. Also with the use of match.Groups["digits"].Captures it made the whole story a lot easier.
You could use an alternation to list the 3 different allowed formats. In .NET it is supported to reuse the group name.
-prefix\s*(?:(?<digits>[0-9]{18})\s*<#(?<digits>[0-9]{18})>|(?<digits>[0-9]{18})|<#(?<digits>[0-9]{18}))
Pattern parts
-prefix\s* Match literally followed by 0+ whitespace characters
(?: Non capturing group
(?<digits>[0-9]{18})\s*<#(?<digits>[0-9]{18})> 2 named capturing groups which will match the digits
| Or
(?<digits>[0-9]{18}) Named capturing group, match digits only
| Or
<#(?<digits>[0-9]{18}) Named capturing group, match digits between brackets only
)
Regex demo
You could also use 2 named capturing groups, 1 for each format. For example:
-prefix\s*(?:(?<digits>[0-9]{18})\s*<#(?<digitsBrackets>[0-9]{18})>|(?<digits>[0-9]{18})|<#(?<digitsBrackets>[0-9]{18}))
Regex demo

How to validate Regex

Im having a hard time with grouping parts of a Regex. I want to validate a few things in a string that follows this format: I-XXXXXX.XX.XX.XX
Validate that the first set of 6 X's (I-xxxxxx.XX.XX.XX) does not contain characters and its length is no more than 6.
Validate that the third set of X's (I-XXXXXX.XX.xx.XX) does not contain characters and is only 1 or 2.
Now, I have already validation on the last set of XX's to make sure the numbers are 1-8 using
string pattern1 = #"^.+\.(0?[1-8])$";
Match match = Regex.Match(TxtWBS.Text, pattern1);
if (match.Success)
;
else
{ errMessage += "WBS invalid"; errMessage +=
Environment.NewLine; }
I just cant figure out how to target specific parts of the string. Any help would be greatly appreciated and thank you in advance!
You're having some trouble adding new validation to this string because it's very generic. Let's take a look at what you're doing:
^.+\.(0?[1-8])$
This finds the following:
^ the start of the string
.+ everything it can, other than a newline, basically jumping the engine's cursor to the end of your line
\. the last period in the string, because of the greedy quantifier in the .+ that comes before it
0? a zero, if it can
[1-8] a number between 1 and 8
()$ stores the two previous things in a group, and if the end of the string doesn't come after this, it may even backtrace and try the same thing from the second to last period instead, which we know isn't a great strategy.
This ends up matching a lot of weird stuff, like for example the string The number 0.1
Let's try patterning something more specific, if we can:
^I-(\d{6})\.(\d{2})\.(\d{1,2})\.([1-8]{2})$
This will match:
^I- an I and a hyphen at the start of the string
(\d{6}) six digits, which it stores in a capture group
\. a period. By now, if there was any other number of digits than six, the match fails instead of trying to backtrace all over the place.
(\d{2})\. Same thing, but two digits instead of six.
(\d{1,2})\. Same thing, the comma here meaning it can match between one and two digits.
([1-8]{2}) Two digits that are each between 1 and 8.
$ The end of the string.
I hope I understood what exactly you're trying to match here. Let me know if this isn't what you had in mind.
This regex:
^.-[0-9]{6}(\.[1-8]{1,2}){3}$
will validate the following:
The first character can be any character, but is of length 1
It is followed by a dash
The dash is followed by exactly 6 numbers 0 - 9. (If this could be less than 6 characters - for example, between 3 and 6 characters - just replace {6} with {3,6}).
This is followed by 3 groups of characters. Each of this groups are proceeded by a period, are of length 1 or 2, and can be any number 1 - 8.
An example of a valid string is:
I-587954.12.34.56
This is also valid:
I-587954.1.3.5
But this isn't:
I-587954.12.80.356
because the second-to-last group contains a 0, and because the last group is of length 3.
Pleas let me know if I have misunderstood any of the rules.
^I-([0-9]{1,6})\.(.{1,2})\.(0[1-2])\.(.{1,2})$
groups delimited by . (\.) :
([0-9]{1,6}) - 1-6 digits
(.{1,2}) - 1-2 any single character
(0[1-2]) - 01 or 02
(.{1,2}) - 1-2 any single character
you can write and easy test regex on your input data, just google "regex online"

How to mask first 6 and last 4 digits for a credit card number in .net

I'm very new to regex And I'm trying to use a regular expression to turn a credit card number which will be part of a conversation into something like 492900******2222
As it can come from any conversation it might contain string next to it or might have an inconsistent format, so essentially all of the below should be formatted to the example above:
hello my number is492900001111222
number is 4929000011112222ok?
4929 0000 1111 2222
4929-0000-1111-2222
It needs to be a regular expression which extracts the capture group of which I will then be able to use a MatchEvaluator to turn all digits (excluding non digits) which are not the first 6 and last 4 into a *
I've seen many examples here on stack overflow for PHP and JS but none which helps me resolve this issue.
Any guidance will be appreciated
UPDATE
I need to expand upon an existing implementation which uses MatchEvaluator to mask each character that is not the first 6 or last 4 and ideally I dont want to change the MatchEvaluator and just make the masking flexible based on the regular expression, see this for an example https://dotnetfiddle.net/J2LCo0
UPDATE 2
#Matt.G and #CAustin answers do resolve what I asked for but I am hitting another barrier where I cant have it be so strict. The final captured group needs to only take into account the digits and as such maintain the format of the input text.
So for example:
If some types in my card number is 99 9988 8877776666 the output from the evaluation should be 99 9988 ******666666
OR
my card number is 9999-8888-7777-6666 it should output 9999-88**-****-6666.
Is this possible?
Changed the list to include items that are in my unit tests https://dotnetfiddle.net/tU6mxQ
Try Regex: (?<=\d{4}\d{2})\d{2}\d{4}(?=\d{4})|(?<=\d{4}( |-)\d{2})\d{2}\1\d{4}(?=\1\d{4})
Regex Demo
C# Demo
Explanation:
2 alternative regexes
(?<=\d{4}\d{2})\d{2}\d{4}(?=\d{4}) - to handle cardnumbers without any separators (- or <space>)
(?<=\d{4}( |-)\d{2})\d{2}\1\d{4}(?=\1\d{4}) - to handle cardnumbers with separator (- or <space>)
1st Alternative (?<=\d{4}\d{2})\d{2}\d{4}(?=\d{4})
Positive Lookbehind (?<=\d{4}\d{2}) - matches text that has 6 digits immediately behind it
\d{2} matches a digit (equal to [0-9])
{2} Quantifier — Matches exactly 2 times
\d{4} matches a digit (equal to [0-9])
{4} Quantifier — Matches exactly 4 times
Positive Lookahead (?=\d{4}) - matches text that is followed immediately by 4 digits
Assert that the Regex below matches
\d{4} matches a digit (equal to [0-9])
{4} Quantifier — Matches exactly 4 times
2nd Alternative (?<=\d{4}( |-)\d{2})\d{2}\1\d{4}(?=\1\d{4})
Positive Lookbehind (?<=\d{4}( |-)\d{2}) - matches text that has (4 digits followed by a separator followed by 2 digits) immediately behind it
1st Capturing Group ( |-) - get the separator as a capturing group, this is to check the next occurence of the separator using \1
\1 matches the same text as most recently matched by the 1st capturing group (separator, in this case)
Positive Lookahead (?=\1\d{4}) - matches text that is followed by separator and 4 digits
If performance is a concern, here's a pattern that only goes through 94 steps, instead of the other answer's 473, by avoiding lookaround and alternation:
\d{4}[ -]?\d{2}\K\d{2}[ -]?\d{4}
Demo: https://regex101.com/r/0XMluq/4
Edit: In C#'s regex flavor, the following pattern can be used instead, since C# allows variable length lookbehind.
(?<=\d{4}[ -]?\d{2})\d{2}[ -]?\d{4}
Demo

Regular Expression/String split

I am not as familiar with RegEx as I probably should be.
However, I am looking for an expression(s) that matches a variant of values.
I have a list of values (about 30k of them total):
ABCD1234
EF56789
GH123456J
GH123456JK
LMN654987P
I need to be able to split the letters at the front, the number is the middle and the letters at the end into 3 different variables. The values have an undetermined amount of characters at the start, undetermined amount of numbers in the middle and undetermined number of letters at the end.
Any help is appreciated.
You can use a regex with capturing groups like this instead of splitting:
([A-Z]+)([0-9]+)([A-Z]*)
Working demo
Also if you want to match strings as case insensitive you can use the i flag.
Working demo
Match information:
MATCH 1
1. [0-4] `ABCD`
2. [4-8] `1234`
3. [8-8] ``
MATCH 2
1. [9-11] `EF`
2. [11-16] `56789`
3. [16-16] ``
MATCH 3
1. [17-19] `GH`
2. [19-25] `123456`
3. [25-26] `J`
MATCH 4
1. [27-29] `GH`
2. [29-35] `123456`
3. [35-37] `JK`
MATCH 5
1. [38-41] `LMN`
2. [41-47] `654987`
3. [47-48] `P`
Additionally, if you don't want the empty content then you could use this regex:
([a-z]+)([0-9]+)([a-z]+)?
You could simply iterate over each line and split them using entire block of numbers as a delimiter.
When you include a capture group in the regex used to identify the delimiter, the delimiter is then included in the returned array.
string[] substrings = Regex.Split(originalString, #"([0-9]+)")

Need C# Regex to match a four digit sequence, but ignore any single digits peceeding

OK, I need to improve this question. Let me try this again:
I need to parse out a flight time which comes after an airport code, but may have a single digit and white space between the two.
Example data:
ORD 1100
HOU 1 1215
MAD 4 1300
I tried this:
([A-Z]{3})\s?\d?\s?(\d{4})
I end up with the airport code and a single digit.
I need a regex that will ignore everything after the airport code except the 4 digit flight time.
Hope I improved my question.
The solution might be as simple as:
\d{4}
According to your inputs you don't need to care about preceeding digits..
This is the answer I would use:
#"([A-Z]{3})\s+(?:[0-9]\s+)?([0-9]{4})"
Basically it is very similar to what you were attempting to do.
The first part is ([A-Z]{3}), which looks for 3 uppercase letters and assigns them to group 1 (Group 0 is the entire string).
The second part is \s+(?:[0-9]\s+)?, which requires at least one space, with the possibility of 1 digit in there somewhere. The noncapturing group in the middle requires that if there is a single digit there, it must be followed by at least 1 space. This prevents a mismatch for something like ABC 12345.
Next we have ([0-9]{4}), which simply matched the 4 digits you are looking for. These can be found in group 2. I use [0-9] here since \d refers to more digits than what we are used to (Like Eastern Arabic numerals).
Here's a little something, using lookbehind and lookahead to be sure there are only 4 digits, with non-digits (or beginning/end) surrounding them.
"(?<=[^\d]|^)\d{4}(?=[^\d]|$)"
The two [^\d] can be replaced with [\s] to only match 4-digits with whitespace around them.
Update:
With your latest update, I merged my regex with yours (from the comment) and came up with this:
"(?<=[A-Z]{3}\s(\d\s)?)\d{4}(?=\s|$)"
There are three parts to the pattern. First is the lookbehind: (?<=PatternHere). The pattern inside this must occur/match before what we seek.
The next part is our simple main pattern: \d{4}, four digits.
The last part is the lookahead: (?=PatternHere), which is pretty much the same as lookbehind, but checks the other side, forward.

Categories