Regular Expression/String split - c#

I am not as familiar with RegEx as I probably should be.
However, I am looking for an expression(s) that matches a variant of values.
I have a list of values (about 30k of them total):
ABCD1234
EF56789
GH123456J
GH123456JK
LMN654987P
I need to be able to split the letters at the front, the number is the middle and the letters at the end into 3 different variables. The values have an undetermined amount of characters at the start, undetermined amount of numbers in the middle and undetermined number of letters at the end.
Any help is appreciated.

You can use a regex with capturing groups like this instead of splitting:
([A-Z]+)([0-9]+)([A-Z]*)
Working demo
Also if you want to match strings as case insensitive you can use the i flag.
Working demo
Match information:
MATCH 1
1. [0-4] `ABCD`
2. [4-8] `1234`
3. [8-8] ``
MATCH 2
1. [9-11] `EF`
2. [11-16] `56789`
3. [16-16] ``
MATCH 3
1. [17-19] `GH`
2. [19-25] `123456`
3. [25-26] `J`
MATCH 4
1. [27-29] `GH`
2. [29-35] `123456`
3. [35-37] `JK`
MATCH 5
1. [38-41] `LMN`
2. [41-47] `654987`
3. [47-48] `P`
Additionally, if you don't want the empty content then you could use this regex:
([a-z]+)([0-9]+)([a-z]+)?

You could simply iterate over each line and split them using entire block of numbers as a delimiter.
When you include a capture group in the regex used to identify the delimiter, the delimiter is then included in the returned array.
string[] substrings = Regex.Split(originalString, #"([0-9]+)")

Related

Matching a sequence of characters splitted by spaces after a prefix

I have the following strings:
-prefix <#141222969505480701> where the second part e.g. <#141222969505480701> can be repeated unlimited times (only the numbers change).
-prefix 141222969505480701 which should behave the same as above.
-prefix 141222969505480701 <#141222969505480702> which would still be able to repeat itself forever.
The last one should have groups containing 141222969505480701 and 141222969505480702.
So a few bits of information:
The digit chains are always 18 in total so I use \d{18} in my regex
I would like to have the numbers in groups for me to use them afterwards.
What I tried
First of I tried to match the first of my example strings.
-prefix(\s<#\d{18}>)\1* which would match the entire string, but I would like to have the digits itself in its own group. Also this method only matches the same parts e.g. <#141222969505480701> <#141222969505480701> <#141222969505480701> would match, but any other number in between wouldn't match.
What would sound logical in my head
-prefix (\d{18})+ but it would only match the first one of the 'digit parts'.
While I was testing it on regex101 it told me the following:
A repeated capturing group will only capture the last iteration. Put a capturing group around the repeated group to capture all iterations or use a non-capturing group instead if you're not interested in the data.
I tried to adjust the regex to the following -prefix ((\d{18})+), but with the same result.
With the help to of #madreflection in the comments I was able to come up with this solution:
-prefix([\s]*(<#|)(?<digits>[0-9]{18})>?)+
Which is exactly what I needed, which even ignores spaces in between. Also with the use of match.Groups["digits"].Captures it made the whole story a lot easier.
You could use an alternation to list the 3 different allowed formats. In .NET it is supported to reuse the group name.
-prefix\s*(?:(?<digits>[0-9]{18})\s*<#(?<digits>[0-9]{18})>|(?<digits>[0-9]{18})|<#(?<digits>[0-9]{18}))
Pattern parts
-prefix\s* Match literally followed by 0+ whitespace characters
(?: Non capturing group
(?<digits>[0-9]{18})\s*<#(?<digits>[0-9]{18})> 2 named capturing groups which will match the digits
| Or
(?<digits>[0-9]{18}) Named capturing group, match digits only
| Or
<#(?<digits>[0-9]{18}) Named capturing group, match digits between brackets only
)
Regex demo
You could also use 2 named capturing groups, 1 for each format. For example:
-prefix\s*(?:(?<digits>[0-9]{18})\s*<#(?<digitsBrackets>[0-9]{18})>|(?<digits>[0-9]{18})|<#(?<digitsBrackets>[0-9]{18}))
Regex demo

Repeating pattern matching with Regex

I am trying to validate an input with a regular expression. Up until now all my tests fail and as my experience with regex is limited I thought someone might be able to help me out.
Pattern: digit (possibly "," digit) (possibly ;)
A String may not begin with a ; and not end with a ;.
Digits are allowed to stand alone or with
My regEx (not working): ((\d)(,\d)?)(;?) the problem is it does not seem to check until the end of the string. Also the optional parts are giving me headaches.
Update: ^[0-9]+(,[0-9])?(;[0-9]+(,[0-9])?)+$this seems to work better but it does not match the single digit.
OK:
2,3;4,4;3,2
2,3
2
2,3;3;4,3
NOK:
2,3,,,,
2,3asfafafa
;2,3
2,3;;3,4
2,3;3,4;
Your ^[0-9]+(,[0-9])?(;[0-9]+(,[0-9])?)+$ regex matches 1 or more digits, then an optional sequence of , and 1 digit, followed with one or more similar sequences.
You need to match zero or more comma-separated numbers:
^\d+(?:,\d+)?(?:;\d+(?:,\d+)?)*$
^
See the regex demo
Now, tweaking part:
If only single-digit numbers should be matched, use ^\d(?:,\d)?(?:;\d(?:,\d)?)*$
If the comma-separated number pairs can have the second element empty, add ? after each ,\d (if single digit numbers are to be matched) or * (if the numbers can have more than one digit): ^\d(?:,\d?)?(?:;\d(?:,\d?)?)*$ or ^\d+(?:,\d*)?(?:;\d+(?:,\d*)?)*$.

C# Regular Expression: Search the first 3 letters of each name

Does anyone know how to say I can get a regex (C#) search of the first 3 letters of a full name?
Without the use of (.*)
I used (.**)but it scrolls the text far beyond the requested name, or
if it finds the first condition and after 100 words find the second condition he return a text that is not the look, so I have to limit in number of words.
Example: \s*(?:\s+\S+){0,2}\s*
I would like to ignore names with less than 3 characters if they exist in name.
Search any name that contains the first 3 characters that start with:
'Mar Jac Rey' (regex that performs search)
Should match:
Marck Jacobs L. S. Reynolds
Marcus Jacobine Reys
Maroon Jacqueline by Reyils
Can anyone help me?
The zero or more quantifier (*) is 'greedy' by default—that is, it will consume as many characters as possible in order to finding the remainder of the pattern. This is why Mar.*Jac will match the first Mar in the input and the last Jac and everything in between.
One potential solution is just to make your pattern 'non-greedy' (*?). This will make it consume as few characters as possible in order to match the remainder of the pattern.
Mar.*?Jac.*?Rey
However, this is not a great solution because it would still match the various name parts regardless of what other text appears in between—e.g. Marcus Jacobine Should Not Match Reys would be a valid match.
To allow only whitespace or at most 2 consecutive non-whitespace characters to appear between each name part, you'd have to get more fancy:
\bMar\w*(\s+\S{0,2})*\s+Jac\w*(\s+\S{0,2})*\s+Rey\w*
The pattern (\s+\S{0,2})*\s+ will match any number of non-whitespace characters containing at most two characters, each surrounded by whitespace. The \w* after each name part ensures that the entire name is included in that part of the match (you might want to use \S* instead here, but that's not entirely clear from your question). And I threw in a word boundary (\b) at the beginning to ensure that the match does not start in the middle of a 'word' (e.g. OMar would not match).
I think what you want is this regular expression to check if it is true and is case insensitive
#"^[Mar|Jac|Rey]{3}"
Less specific:
#"^[\w]{3}"
If you want to capture the first three letters of every words of at least three characters words you could use something like :
((?<name>[\w]{3})\w+)+
And enable ExplicitCapture when initializing your Regex.
It will return you a serie of Match named "name", each one of them is a result.
Code sample :
Regex regex = new Regex(#"((?<name>[\w]{3})\w+)+", RegexOptions.ExplicitCapture | RegexOptions.IgnoreCase);
var match = regex.Matches("Marck Jacobs L. S. Reynolds");
If you want capture also 3 characters words, you can replace the last "\w" by a space. In this case think to handle the last word of the phrase.

Regex Substring or Left Equivalent

Greetings beloved comrades.
I cannot figure out how to accomplish the following via a regex.
I need to take this format number 201101234 and transform it to 11-0123401, where digits 3 and 4 become the digits to the left of the dash, and the remaining five digits are inserted to the right of the dash, followed by a hardcoded 01.
I've tried http://gskinner.com/RegExr, but the syntax just defeats me.
This answer, Equivalent of Substring as a RegularExpression, sounds promising, but I can't get it to parse correctly.
I can create a SQL function to accomplish this, but I'd rather not hammer my server in order to reformat some strings.
Thanks in advance.
You can try this:
var input = "201101234";
var output = Regex.Replace(input, #"^\d{2}(\d{2})(\d{5})$", "${1}-${2}01");
Console.WriteLine(output); // 11-0123401
This will match:
two digits, followed by
two digits captured as group 1, followed by
five digits captured as group 2
And return a string which replaces that matched text with
group 1, followed by
a literal hyphen, followed by
group 2, followed by
a literal 01.
The start and end anchors ( ^ / $ ) ensure that if the input string does not exactly match this pattern, it will simply return the original string.
If you can use custom C# scripts, you may want to use Substring instead:
string newStr = string.Format("{0}-{1}01", old.Substring(2,2), old.Substring(4));
I don't think you really need a regex here. Substring would be better. But still if you want regex only, you can use this:
string newString = Regex.Replace(input, #"^\d{2}(\d{2})(\d+)$", "$1-${2}01");
Explanation:
^\d{2} // Match first 2 digits. Will be ignored
(\d{2}) // Match next 2 digits. Capture it in group 1
(\d+)$ // Match rest of the digits. Capture it in group 2
Now, the required digits, are in group 1 and 2, which you use in the replacement string.
Do you even SQL? Pull some levers and stuff.

Need C# Regex to match a four digit sequence, but ignore any single digits peceeding

OK, I need to improve this question. Let me try this again:
I need to parse out a flight time which comes after an airport code, but may have a single digit and white space between the two.
Example data:
ORD 1100
HOU 1 1215
MAD 4 1300
I tried this:
([A-Z]{3})\s?\d?\s?(\d{4})
I end up with the airport code and a single digit.
I need a regex that will ignore everything after the airport code except the 4 digit flight time.
Hope I improved my question.
The solution might be as simple as:
\d{4}
According to your inputs you don't need to care about preceeding digits..
This is the answer I would use:
#"([A-Z]{3})\s+(?:[0-9]\s+)?([0-9]{4})"
Basically it is very similar to what you were attempting to do.
The first part is ([A-Z]{3}), which looks for 3 uppercase letters and assigns them to group 1 (Group 0 is the entire string).
The second part is \s+(?:[0-9]\s+)?, which requires at least one space, with the possibility of 1 digit in there somewhere. The noncapturing group in the middle requires that if there is a single digit there, it must be followed by at least 1 space. This prevents a mismatch for something like ABC 12345.
Next we have ([0-9]{4}), which simply matched the 4 digits you are looking for. These can be found in group 2. I use [0-9] here since \d refers to more digits than what we are used to (Like Eastern Arabic numerals).
Here's a little something, using lookbehind and lookahead to be sure there are only 4 digits, with non-digits (or beginning/end) surrounding them.
"(?<=[^\d]|^)\d{4}(?=[^\d]|$)"
The two [^\d] can be replaced with [\s] to only match 4-digits with whitespace around them.
Update:
With your latest update, I merged my regex with yours (from the comment) and came up with this:
"(?<=[A-Z]{3}\s(\d\s)?)\d{4}(?=\s|$)"
There are three parts to the pattern. First is the lookbehind: (?<=PatternHere). The pattern inside this must occur/match before what we seek.
The next part is our simple main pattern: \d{4}, four digits.
The last part is the lookahead: (?=PatternHere), which is pretty much the same as lookbehind, but checks the other side, forward.

Categories