Regex with capture groups and lookbehind within alternation, C# - c#

I am extracting info out of log files which has the format below:
2023-02-08 17:29:19,803 [50] INFO Queue Index [1] out of [6]
.......
[2023-02-08 17:29:18,947] Total time [00:00:00.0093239|9ms]
My end goal is to get:
"Queue index 1",
"out of 6",
"Total time 9"
Current regex:
\b(?:(?<=(Queue Index) .?)|(?<=(out of) .?)|(?<=(Total time) .+\|))\d+(?:\.\d+)?
I expect 3 capture groups (Queue index, out of, total time), but the digit is always included in the captured group and I don't want that. I already included ?: in the alternation bracket. How do I refine this?

Related

RegEx pattern to capture invoice line items containing unit prices in description

Using C#, I am attempting to extract individual invoice line items from a block of text containing ALL the line items. For each line item, I want to separate and capture the Line Item Code, Line Item Description, and Line Item Dollar Amount. The issue is that many of the line item descriptions include decimal amounts similar to dollar amounts, so the regex I am using is capturing several entire line items into one line item description. How can I alter my regex statement to include these decimal numbers in the description, while still separating prices into another match group? I am also open to other optimization suggestions
Here is the block of line items that is giving me trouble:
1244 Drayage Charge MEDU2265085
1,875.00
4083 Chassis MEDU2265085 TRIAXLE 4 DAYS
640.00
1268 Pre-Pull MEDU2265085
250.00
1248 Truck Waiting & Over Time MEDU2265085 3.5*120
420.00
1244 Drayage Charge MEDU3325790
1,875.00
4083 Chassis MEDU3325790 TRIAXLE 4 DAYS
640.00
1268 Pre-Pull MEDU3325790
250.00
1248 Truck Waiting & Over Time MEDU3325790 2.38*120
285.60
1244 Drayage Charge MSCU3870551
1,875.00
4083 Chassis MSCU3870551 TRIAXLE 4 DAYS
640.00
1268 Pre-Pull MSCU3870551
250.00
1248 Truck Waiting & Over Time MSCU3870551 3.5*120
420.00
And here is my best attempt at a regex pattern:
(?<LINE_ITEM_CODE>[0-9]{4})[\r\s\n](?<LINE_ITEM_DESCRIPTION>[A-Za-z0-9\r\s\n\-\%\&\*\.]*)[\r\n\s](?<LINE_ITEM_AMOUNT>[0-9\,]{1,7}.[0-9]{2})
If you punch these in over at regexr.com or regexstorm.net, you'll see that several of the line items are being captured as a single line item description. The alternative I had been using previously did not accommodate the 3.5, 2.38 etc. How can I target the prices while still grouping the other decimals into the description?
I'm open to alternative solutions
You can use
(?m)^(?<LINE_ITEM_CODE>\d{4})\s+(?<LINE_ITEM_DESCRIPTION>.*?)\r?\n(?<LINE_ITEM_AMOUNT>\d{1,3}(?:,\d{3})*\.\d{2})
See the regex demo.
Details:
(?m)^ - a multiline flag that makes ^ match start of a line
(?<LINE_ITEM_CODE>\d{4}) - Group "LINE_ITEM_CODE": four digits
\s+ - one or more whitespaces (including newlines)
(?<LINE_ITEM_DESCRIPTION>.*?) - Group "LINE_ITEM_DESCRIPTION": any zero or more chars other than newline chars as few as possible
\r?\n - CRLF or LF
(?<LINE_ITEM_AMOUNT>\d{1,3}(?:,\d{3})*\.\d{2}) - Group "LINE_ITEM_AMOUNT": one to three digits and then zero or more repetitions of a comma and three digits and then a dot and two digits.
`

How to mask first 6 and last 4 digits for a credit card number in .net

I'm very new to regex And I'm trying to use a regular expression to turn a credit card number which will be part of a conversation into something like 492900******2222
As it can come from any conversation it might contain string next to it or might have an inconsistent format, so essentially all of the below should be formatted to the example above:
hello my number is492900001111222
number is 4929000011112222ok?
4929 0000 1111 2222
4929-0000-1111-2222
It needs to be a regular expression which extracts the capture group of which I will then be able to use a MatchEvaluator to turn all digits (excluding non digits) which are not the first 6 and last 4 into a *
I've seen many examples here on stack overflow for PHP and JS but none which helps me resolve this issue.
Any guidance will be appreciated
UPDATE
I need to expand upon an existing implementation which uses MatchEvaluator to mask each character that is not the first 6 or last 4 and ideally I dont want to change the MatchEvaluator and just make the masking flexible based on the regular expression, see this for an example https://dotnetfiddle.net/J2LCo0
UPDATE 2
#Matt.G and #CAustin answers do resolve what I asked for but I am hitting another barrier where I cant have it be so strict. The final captured group needs to only take into account the digits and as such maintain the format of the input text.
So for example:
If some types in my card number is 99 9988 8877776666 the output from the evaluation should be 99 9988 ******666666
OR
my card number is 9999-8888-7777-6666 it should output 9999-88**-****-6666.
Is this possible?
Changed the list to include items that are in my unit tests https://dotnetfiddle.net/tU6mxQ
Try Regex: (?<=\d{4}\d{2})\d{2}\d{4}(?=\d{4})|(?<=\d{4}( |-)\d{2})\d{2}\1\d{4}(?=\1\d{4})
Regex Demo
C# Demo
Explanation:
2 alternative regexes
(?<=\d{4}\d{2})\d{2}\d{4}(?=\d{4}) - to handle cardnumbers without any separators (- or <space>)
(?<=\d{4}( |-)\d{2})\d{2}\1\d{4}(?=\1\d{4}) - to handle cardnumbers with separator (- or <space>)
1st Alternative (?<=\d{4}\d{2})\d{2}\d{4}(?=\d{4})
Positive Lookbehind (?<=\d{4}\d{2}) - matches text that has 6 digits immediately behind it
\d{2} matches a digit (equal to [0-9])
{2} Quantifier — Matches exactly 2 times
\d{4} matches a digit (equal to [0-9])
{4} Quantifier — Matches exactly 4 times
Positive Lookahead (?=\d{4}) - matches text that is followed immediately by 4 digits
Assert that the Regex below matches
\d{4} matches a digit (equal to [0-9])
{4} Quantifier — Matches exactly 4 times
2nd Alternative (?<=\d{4}( |-)\d{2})\d{2}\1\d{4}(?=\1\d{4})
Positive Lookbehind (?<=\d{4}( |-)\d{2}) - matches text that has (4 digits followed by a separator followed by 2 digits) immediately behind it
1st Capturing Group ( |-) - get the separator as a capturing group, this is to check the next occurence of the separator using \1
\1 matches the same text as most recently matched by the 1st capturing group (separator, in this case)
Positive Lookahead (?=\1\d{4}) - matches text that is followed by separator and 4 digits
If performance is a concern, here's a pattern that only goes through 94 steps, instead of the other answer's 473, by avoiding lookaround and alternation:
\d{4}[ -]?\d{2}\K\d{2}[ -]?\d{4}
Demo: https://regex101.com/r/0XMluq/4
Edit: In C#'s regex flavor, the following pattern can be used instead, since C# allows variable length lookbehind.
(?<=\d{4}[ -]?\d{2})\d{2}[ -]?\d{4}
Demo

Regular Expression/String split

I am not as familiar with RegEx as I probably should be.
However, I am looking for an expression(s) that matches a variant of values.
I have a list of values (about 30k of them total):
ABCD1234
EF56789
GH123456J
GH123456JK
LMN654987P
I need to be able to split the letters at the front, the number is the middle and the letters at the end into 3 different variables. The values have an undetermined amount of characters at the start, undetermined amount of numbers in the middle and undetermined number of letters at the end.
Any help is appreciated.
You can use a regex with capturing groups like this instead of splitting:
([A-Z]+)([0-9]+)([A-Z]*)
Working demo
Also if you want to match strings as case insensitive you can use the i flag.
Working demo
Match information:
MATCH 1
1. [0-4] `ABCD`
2. [4-8] `1234`
3. [8-8] ``
MATCH 2
1. [9-11] `EF`
2. [11-16] `56789`
3. [16-16] ``
MATCH 3
1. [17-19] `GH`
2. [19-25] `123456`
3. [25-26] `J`
MATCH 4
1. [27-29] `GH`
2. [29-35] `123456`
3. [35-37] `JK`
MATCH 5
1. [38-41] `LMN`
2. [41-47] `654987`
3. [47-48] `P`
Additionally, if you don't want the empty content then you could use this regex:
([a-z]+)([0-9]+)([a-z]+)?
You could simply iterate over each line and split them using entire block of numbers as a delimiter.
When you include a capture group in the regex used to identify the delimiter, the delimiter is then included in the returned array.
string[] substrings = Regex.Split(originalString, #"([0-9]+)")

Splitting string into groups with Regular Expression

I need a Regular expression that allows me to split the following string in c#:
$1.89 BROWN RICE ‐ 16 03/01 ‐ 03/07 1.29
into something like this:
Group 1: BROWN RICE - 16
Group 2: 03/01 ‐ 03/07
Group 3: 1.29
Is it possible to achieve this with Regex?
In your case I think a regex will be better than splitting.
If it's original price (Product - Qty) (date range) (sale price), you can try something like
\$?\d+\.\d{2} ([A-Za-z ]+- *\d+) +(\d{2}/\d{2} *- *\d{2}/\d{2}) +\$?(\d+\.\d{2})
Title & Quantity are in captured group 1, date range in group 2, new price in group 3.
Explanation:
\$?\d+\.\d{2}: a price, optional dollar sign, exactly two decimal places (for the cents). If you want to allow '$1' (ie no decimal places) then modify accordingly.
([A-Za-z ]+- *\d+) Object name and quantity (separated by a hyphen). You may wish to modify this regex depending on the expected names you will get in (perhaps they aren't just consisting of letters and spaces).
(\d{2}/\d{2} *- *\d{2}/\d{2}) date range. I have no idea if yours is month/day or day/month, but depending on that you can make your regex a lot more exclusive if you wish (for example, a numeric date is ([012]\d|3[01]) and a month only goes from 1 to 12).
\$?(\d+\.\d{2}) the saleprice.
Have you tried having a go with something like regexpal? Makes it easy to test out how to filter the data you're interested in. There's a bunch of hints in the top right which basically describes how to write regular expressions too...
First we want to capture the price but we dont' care about it so the ? ignores that group:
(?:\$\d+\.\d+)
Since we know what the third part should look like, our first section of interest can gobble up anything in the middle:
(.*)
Next we want to match that date range:
(\d{2}/\d{2} ‐ \d{2}/\d{2})
And finally we want a floating point number:
(\d+\.\d+)
So in conclusion, something like this should work:
(?:\$\d+\.\d+) (.*) (\d{2}/\d{2} ‐ \d{2}/\d{2}) (\d+\.\d+)
You'll need to escape the backslashes to include that in c#
(\$\d\.\d{2}) (.*?) (\d{2}/\d{2} - \d{2}/\d{2}) (.*)
This works on your example. It may need to be improved if you have any more data variations
(\$\d\.\d{2}) - Match the price $0.00
- If prices can be more than $9 then you'd need to
make this match one or more (\$\d+\.\d{2})
(.*?) - Lazy match everything till the next group
(\d{2}/\d{2} - \d{2}/\d{2}) - Match the date range
(.*) - Match what ever is left
You may also wish to put start and end line constraints if you're reading a bunch of these from a text file.
/^\$\d*\.\d{2,}\s([^-]+\s[-]\s\d+)\s(\d{2}\/\d{2}\s[-]\s\d{2}\/\d{2})\s(\d*\.\d{2,})$/
Group 1 : Brown Rice - 16
Group 2 : 03/01 ‐ 03/07
Group 3 : 1.29 (will also match 0.29 and .29)
Try
(\$\d+\.\d+)\s(.*?)\s(\d{2}/\d{2}\s-\s\d{2}/\d{2})\s(\d+\.\d+)
(\$\d+\.\d+) matches the price in dollars
(.*?) matches the product name
(\d{2}/\d{2}\s-\s\d{2}/\d{2}) matches the date range
(\d+\.\d+) matches the second price
I noted that the minus sign (-) in your example uses a different character code as the standard minus sign. Therefore, my Regex did not want to work until I replaced your "-" by normal ones.

Groups in a C# regular expression

I'm using the following tester to try and figure out this regex:
http://derekslager.com/blog/posts/2007/09/a-better-dotnet-regular-expression-tester.ashx
My input:
123stringA 456 stringB
My pattern:
([0-9]{3})(.*?)
The pattern will eventually be a date but for this question's sake, I'll keep it simple and use my simplified input.
The way I understand this pattern, it's "give me 3 numbers [0-9]{3}, followed by any number of characters of any kind .*, until it reaches the next match ?
What I want/expect out of this test is 2 matches with 2 groups each:
Match 1
Group 1 - 123
Group 2 - stringA
Match2
Group 1 - 456
Group 2 - stringB
For some reason, the tester at the link I provided sees that there is a second group, but it's coming up blank. I have done this with PHP before and it seemed to work as I described, but in C# I'm seeing different results. Any help you can provide would be appreciated.
I should also note that this could expand multiple lines...
EDIT *
Here's the actual input:
2011-08-09 09:25:57,069 [9] Orchard.Environment.Extensions.ExtensionManager - Error loading extension 2011-08-09 09:25:57,493 [8] Orchard.Environment.Extensions.ExtensionManager
For match 1 I'm wanting to get:
2011-08-09 09:25:57 and
,069 [9] Orchard.Environment.Extensions.ExtensionManager - Error loading extension
and for match 2:
2011-08-09 09:25:57 and
,493 [8] Orchard.Environment.Extensions.ExtensionManager
I'm trying to find a good way to parse an error log file that's in one giant text file and maintain the date the error happened and the details that went along with it
The first group matches 3 digits and the second group matches the remainder of the string because there's nothing in the pattern to prevent the .*? from not matching the remainder of the string.
CORRECTION: The second group matches an empty string because there's nothing in the pattern to prevent the .*? from not matching an empty string.
.* means match anything zero or more times. ? Mean to find the minimal number of times, so it chooses zero matches as the minimum.
Try this pattern, ([0-9]{3})([a-zA-Z]*)
According to your comment, this is what you want to match
2011-08-09 09:25:57,069 [9]
Orchard.Environment.Extensions.ExtensionManager - Error loading
extension 2011-08-09 09:25:57,493 [8]
Orchard.Environment.Extensions.ExtensionManager - Error loading
extension
This expression will match the Date in the first capturing group and the rest till the next date OR till the end of the string in the second capturing group.
(\d{4}(?:-\d{2}){2})(.*?)(?=(?:\d{4}(?:-\d{2}){2}|$))
See it here on Regexr
Not sure why the tool gives you that, but you can switch to this alternative pattern that works in .Net
([0-9]{3})([^0-9]*)
http://regexhero.net/tester/?id=155b8e2b-b851-46b9-8a84-b82f8d6963a1
Explanation:
In your previous pattern, the nongreedy version was matching 0 characters.
In the new one, [^0-9] says match any character other than the range 0-9 (note the negation ^ specifier).
Update: Given the actual input string (in comments), the pattern changes to (its a guess assuming what the OP wants to do:
,([0-9]{3})([^\n]*)
http://regexhero.net/tester/?id=155b8e2b-b851-46b9-8a84-b82f8d6963a1

Categories