C# Regex to obtain string up until a pattern

C# Regex to obtain string up until a pattern - c#

I've always been really bad when it comes to using regular expressions but it is something I want to seriously understand because as we all know, it is quite useful.
This is for a personal project, to keep my folders organized and neat.
I have a bunch of folders with the following naming pattern XXXXXXXX.XXXXXXX.XXXXXX.SYY.EYY.SOMETHINGELSE
There can be any amount of X repeating separated by ".", but the SYY.EYY is always there. So what I want is a regular expression to retrieve all the text represented by XXX without the "." if possible up until the SYY.EYY pattern.
I managed to detect the pattern because YY are always numbers, so doing something like \d{2} will detect it but I'm wondering if its possible to also add the rest of the pattern to that \d{2}.
Any help is appreciate it :)

If the YY is as you stated 2 digits and you want to get the text except the . up until for example S11.E22 you could make use of the \G anchor and a capturing group to get the text without a dot.
The value is in the Match.Groups property.
\G(?!S[0-9]{2}\.E[0-9]{2})([^.]+)\.
In parts
\G Assert position at the end of previous match (start at the beginning)
(?! Negative lookahead, assert what is directly to the right is not
S[0-9]{2}\.E[0-9]{2} Math S, 2 digits, . E and 2 digits
) Close lookahead
( Capture group 1
[^.]+ Match 1+ times any char except a dot
) Close group 1
\. Match dot literal
Regex demo | C# demo
For example
string pattern = #"\G(?!S[0-9]{2}\.E[0-9]{2})([^.]+)\.";
string input = #"XXXXXXXX.XXXXXXX.XXXXXX.S11.E22.SOMETHINGELSE";
foreach (Match m in Regex.Matches(input, pattern))
{
Console.WriteLine(m.Groups[1].Value);
}
Output
XXXXXXXX
XXXXXXX
XXXXXX

You can "replace/cut" the "." with C#.
The regex to get up until the SYY.EYY can be like this:
.SYY.EYY$
Line ends with word -> Regex: ExampleWord$

I would do something like:
var leftPart = Regex.Match(x, "^.*?(?=SYY)").Captures.First().Value;
// this now has XXXXXXXX.XXXXXXX.XXXXXX.
// And we can:
var left = leftPart.Replace(".", " "); // or any other char

Related

How to match any repeated chunks of characters?

I've seen many questions similar to this but none quite like it.
I have strings like this:
HF-01-HF-01-01
FBC-FBC-04
OZYA-03A-OZYA-03A-03
QC-QC-02
and want them to be returned like so:
HF-01-01
FBC-04
OZYA-03A-03
QC-02
I can't figure this out and the other questions I've seen don't apply because 1) the repeated chunk is more than one character, 2) There are no spaces between the repetition.
Or is regex not the best way to do this?
EDIT:
Rules
Alpha chunks are never repeated more than one time.
Some chunks can be alphanumeric but also never repeated more than one
time.
The part that can be repeated would be from the start of the string
and any additional chunks by hyphen.
So you would never have something like HF-HF-01-01. But in this case using the above rules, it would become HF-01-01 since HF is the only part repeated from the beginning of the string.
Perhaps something like this would work:
Scan string to first hyphen, see if that matches anywhere else after first hyphen, if so scan to second hyphen, see if that matches anywhere else, if not, take the first scan and remove one instance of it from the string, if so, scan to third, etc.
But I don't know how to do that in regex.

I'm not sure if RegExp is the right tool here.
Using MoreLinq RunLengthEncode method (that implement R.L.E.) you can achieve it like this:
string RemoveDuplicate(string input)
{
var chunks = input.Split('-') // cut at -
.RunLengthEncode() // group and count adjacent equals chunck
.Select(kvp => kvp.Key);// just take the chunk value
return string.Join("-", chunks); // reglue with -
}
Edit
Doesn't work for:
OZYA-03A-OZYA-03A-03

I guess,
([^-\r\n]+-|[^-\r\n]+-[^-\r\n]+-)(\1.*)
or with start/end anchors,
^([^-\r\n]+-|[^-\r\n]+-[^-\r\n]+-)(\1.*)$
might work to some extent and the desired output is in the last capturing group:
(\1.*)
RegEx Demo 1
RegEx Demo 2
Test
using System;
using System.Text.RegularExpressions;
public class Example
{
public static void Main()
{
string pattern = #"([^-\r\n]+-|[^-\r\n]+-[^-\r\n]+-)(\1.*)";
string input = #"HF-01-HF-01-01
FBC-FBC-04
OZYA-03A-OZYA-03A-03
QC-QC-02
and want them to be returned like so:
HF-01-01
FBC-04
OZYA-03A-03
QC-02";
RegexOptions options = RegexOptions.Multiline;
foreach (Match m in Regex.Matches(input, pattern, options))
{
Console.WriteLine("'{0}' found at index {1}.", m.Value, m.Index);
}
}
}
If you wish to simplify/modify/explore the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.
RegEx Circuit
jex.im visualizes regular expressions:

I'm not sure if regex is the right tool here, but atleast it can be somewhat done with this short pattern:
^([A-Z0-9]+)-.*(\1.*)$
Explanation:
^ start of string
( group 1 start
[A-Z0-9]+ one or more capital letters or digits
) end group 1
- literal
.* any number of any chars
( group 2 start
\1 anything that was matched in group 1
.* any number of any chars
) end group 2 (this group will be used as the result)
$ end of string

.NET Regex for parsing chess moves

Background
I'd like to parse quite a few of strings representing chess moves:
1.e4e62.d3d53.Nd2c54.g3Nf6
Each move begins with an increasing number 1., 2., 3. etc. There are no spaces in-between the moves.
The perfect match would be an array like this:
["1.e4e6", "2.d3d5", "3.Nd2c5", "4.g3Nf6"]
Regex Question
My regex so far is:
([0-9]\.)(.*?)(?=[0-9]\.)
This works in an online .NET Regex Tester (Regex Storm), apart not including the last move (4th).
How to include the last one too?
C# Question
My code is:
var regex = new Regex(#"([0-9]\.)(.*?)(?=[0-9]\.)");
var match = regex.Match(game);
The match here includes only one entry "1.e4e6" and not three (or four).
How to fix?
Thanks,
pom

It can not match the last item because the lookahead assertion is not true as there is no digit and dot following.
You can add to match the end of the string using an alternation.
To get all the results you could use Matches instead.
([0-9]\.)(.*?)(?=[0-9]\.|$)
Regex demo | C# demo
For example
string pattern = #"([0-9]\.)(.*?)(?=[0-9]\.|$)";
string input = #"1.e4e62.d3d53.Nd2c54.g3Nf6";
foreach (Match m in Regex.Matches(input, pattern))
{
Console.WriteLine(m.Value);
}
Note that if you want to get a match only and don't want to match spaces, you can use \S instead of a . and omit the capturing group:
[0-9]\.\S*?(?=[0-9]\.|$)
Regex demo

How to match a string between <>?

I tried \w+\:(\w+\-?\.?(\d+)?) but that is not correct
I have following text
<staticText:HelloWorld>_<xmlNode:Node.03>_<date:yyy-MM-dd>_<time:HH-mm-ss-fff>
The end result I want is something like the following
["staticText:HelloWorld", "xmlNode:Node.03","date:yyy-MM-dd","time:HH-mm-ss-fff"]

You could use the following regex.
<(.*?)>
Then have a look at how groups work to retrieve the result.
Regex rx = new Regex("<(.*?)>");
string text = "<staticText:HelloWorld>_<xmlNode:Node.03>_<date:yyy-MM-dd>_<time:HH-mm-ss-fff>";
MatchCollection matches = rx.Matches(text);
Console.WriteLine(matches.Count);
foreach(Match match in matches){
var groups = match.Groups;
Console.WriteLine(groups[1]);
}

This line should be able to match the content:
<(.*?)>
It will catch the arrows at the end which you don't seem to want, but you could remove them after words without regex.
You should consider a website like https://regexr.com - it helps exponentially in writing regex by allowing you to paste your cases and see how it works with them.

Matches any string within the <>. Hope this helps.
<(.*?)>

Your pattern does not match the 3rd and the 4th part of the example data because in this part \w+\-?\.?(\d+)? the dash and the digits match only once and are not repeated.
For your example data, you might use a character class [\w.-]+to match the part after the colon to make the match a bit more broad:
<(\w+\:[\w.-]+)>
Regex demo | C# demo
Or to make it more specific, specify a pattern for either the Node.03 part and for the year month date hour etc parts using a repeated pattern.
<(\w+\:\w+(?:\.\d+|\d+(?:-\d+)+)?)>
Explanation
< Match <
( Capturing group
\w+\:\w+ Match 1+ word chars, : and 1+ word chars
(?: Non capturing group
\.\d+ Match . and 1+ digits
| Or
\d+(?:-\d+)+ Match 1+ digits and repeat 1+ times matching - and 1+ digits
)? Close non capturing group and make it optional
) Close capturing group
>
Regex demo | C# Demo

Regular expression - secong group of digits

Hi I would like to pull second group of digits which are after (-) from below string:
D:\data\home\Logs_Audit\VO12_LAB_20140617-000301.txt
I used \d{8} to pull 20140617 but now I want to pull 000301
EDIT 1:
Now I would Like to pull VO12_LAB from above string. Could You please help me.
I am not good at regular expression and I didn't find good tutorial to understand it.
EDIT 2:
I found that something like
\w{2,3}\d{2,3}_\w{2,3}
works to me. Do you think it is accurate enough?

You can use lookahead/lookbehind to find the group based on "anchors", like this:
(?<=[-])\\d+(?=[.]txt)
The groups before and after the \\d+ are non-capturing zero-width "markers", in the sense that they do not consume any characters from the string, only describe character combinations that need to precede and/or follow the text that you would like to match.

You can use a Positive Lookahead for this.
\d+(?=\.)
Explanation: This matches digits (1 or more times) preceded by a dot .
\d+ digits (0-9) (1 or more time)
(?= look ahead to see if there is:
\. '.'
) end of look-ahead
Live Demo
Final Solution:
String s = #"D:\data\home\Logs\V_LAB_20140617-000301.txt";
Match m = Regex.Match(s, #"\d+(?=\.)");
if (m.Success) {
Console.WriteLine(m.Value); //=> "000301"
}

You can use this regex:
(?<=-)(\d+)
The first group will contain the digits.
Live Demo

regex for capturing digits and digit ranges

i have the following string
Fat mass loss was 2121,323.222 greater for GPLC (2–2.4kg vs. 0.5kg)
i want to capture
212,323.222
2-2.24
0.5
i.e. i want the above three results from the string,
can any one help me with this regex

I noticed that your hyphen in 2–2.4kg is not really hyphen, its a unicode 0x2013 "DASH".
So, here is another regex in C#
#"[0-9]+([,.\u2013-][0-9]+)*"
Test
MatchCollection matches = Regex.Matches("Fat mass loss was 2121,323.222 greater for GPLC (2–2.4kg vs. 0.5kg)", #"[0-9]+([,.\u2013-][0-9]+)*");
foreach (Match m in matches) {
Console.WriteLine(m.Groups[0]);
}
Here is the results, my console does not support printing unicode char 2013, so its "?" but its properly matched.
2121,323.222
2?2.4
0.5

Okay I didn't notice the C# tag until now. I will leave the answer but I know that's not what you expected, see if you can do something with it. Perhaps the title should have mentioned the programming language?
Sure:
Fat mass loss was (.*) greater for GPLC \((.*) vs. (.*)kg\)
Find your substrings in \1, \2 and \3.
If for Emacs, swap all parentheses and escaped parentheses.

How about something like this:
^.*((?:\d+,)*\d+(?:\.\d+)?).*(\d+(?:\.\d+)?(?:-\d+(?:\.\d+))?).*(\d+(?:\.\d+)).*$
A little more general, I think. I'm a little concerned about .* being greedy.

Fat mass loss was 2121,323.222 greater
for GPLC (2–2.4kg vs. 0.5kg)
a generalized extractor:
/\D+?([\d\,\.\-]+)/g
explanation:
/ # start pattern
\D+ # 1 or more non-digits
( # capture group 1
[\d,.-]+ # character class, 1 or more of digits, comma, period, hyphen
) # end capture group 1
/g # trailing regex g modifier (make regex continue after last match)
sorry I don't know c# well enough for a full writeup, but the pattern should plug right in.
see: http://www.radsoftware.com.au/articles/regexsyntaxadvanced.aspx for some implementation examples.

I came out with something like this atrocity:
-?\d(?:,?\d)*(?:\.(?:\d(?:,?\d)*\d|\d))?(?:[–-]-?\d(?:,?\d)*(?:\.(?:\d(?:,?\d)*\d|\d))?)?
Out of witch -?\d(?:,?\d)*(?:\.(?:\d(?:,?\d)*\d|\d))? is repeated twice, with – in the middle (note that this is a long hyphen).
This should take care of dots and commas outside of numbers, eg: hello,23,45.2-7world - will capture 23,45.2-7.

It looks like you're trying to find all numbers in the string (possibly with commas inside the number), and all ranges of numbers such as "2-2.4". Here is a regex that should work:
\d+(?:[,.-]\d+)*
From C# 3, you can use it like this:
var input = "Fat mass loss was 2121,323.222 greater for GPLC (2-2.4kg vs. 0.5kg)";
var pattern = #"\d+(?:[,.-]\d+)*";
var matches = Regex.Matches(input, pattern);
foreach ( var match in matches )
Console.WriteLine(match.Value);

Hmm, this is a tricky question, especially because the input string contains unicode character – (EN DASH) instead of - (HYPHEN-MINUS). Therefore the correct regex to match the numbers in the original string would be:
\d+(?:[\u2013,.]\d+)*
If you want a more generic approach would be:
\d+(?:[\p{Pd}\p{Pc}\p{Po}]\d+)*
which matches dash punctuation, connecter punctuation and other punctuation. See here for more information about those.
An implementation in C# would look like this:
string input = "Fat mass loss was 2121,323.222 greater for GPLC (2–2.4kg vs. 0.5kg)";
try {
Regex rx = new Regex(#"\d+(?:[\p{Pd}\p{Pc}\p{Po}\p{C}]\d+)*", RegexOptions.IgnoreCase | RegexOptions.Multiline);
Match match = rx.Match(input);
while (match.Success) {
// matched text: match.Value
// match start: match.Index
// match length: match.Length
match = match.NextMatch();
}
} catch (ArgumentException ex) {
// Syntax error in the regular expression
}

Let's try this one :
(?=\d)([0-9,.-]+)(?<=\d)
It captures all expressions containing only :
"[0-9,.-]" characters,
must start with a digit "(?=\d)",
must finish with a digit "(?<=\d)"
It works with a single digit expression and does not include beginning or trailing [.,-].
Hope this helps.

I got the solution to my problem.
The following is the Regex that gave my desired result:
(([0-9]+)([–.,-]*))+

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

C# Regex to obtain string up until a pattern - c#

You can "replace/cut" the "." with C#. The regex to get up until the SYY.EYY can be like this: .SYY.EYY$ Line ends with word -> Regex: ExampleWord$

I would do something like: var leftPart = Regex.Match(x, "^.*?(?=SYY)").Captures.First().Value; // this now has XXXXXXXX.XXXXXXX.XXXXXX. // And we can: var left = leftPart.Replace(".", " "); // or any other char

Related

How to match any repeated chunks of characters?

.NET Regex for parsing chess moves

How to match a string between <>?

Regular expression - secong group of digits

regex for capturing digits and digit ranges

Categories

Resources