How to match any repeated chunks of characters?

How to match any repeated chunks of characters? - c#

I've seen many questions similar to this but none quite like it.
I have strings like this:
HF-01-HF-01-01
FBC-FBC-04
OZYA-03A-OZYA-03A-03
QC-QC-02
and want them to be returned like so:
HF-01-01
FBC-04
OZYA-03A-03
QC-02
I can't figure this out and the other questions I've seen don't apply because 1) the repeated chunk is more than one character, 2) There are no spaces between the repetition.
Or is regex not the best way to do this?
EDIT:
Rules
Alpha chunks are never repeated more than one time.
Some chunks can be alphanumeric but also never repeated more than one
time.
The part that can be repeated would be from the start of the string
and any additional chunks by hyphen.
So you would never have something like HF-HF-01-01. But in this case using the above rules, it would become HF-01-01 since HF is the only part repeated from the beginning of the string.
Perhaps something like this would work:
Scan string to first hyphen, see if that matches anywhere else after first hyphen, if so scan to second hyphen, see if that matches anywhere else, if not, take the first scan and remove one instance of it from the string, if so, scan to third, etc.
But I don't know how to do that in regex.

I'm not sure if RegExp is the right tool here.
Using MoreLinq RunLengthEncode method (that implement R.L.E.) you can achieve it like this:
string RemoveDuplicate(string input)
{
var chunks = input.Split('-') // cut at -
.RunLengthEncode() // group and count adjacent equals chunck
.Select(kvp => kvp.Key);// just take the chunk value
return string.Join("-", chunks); // reglue with -
}
Edit
Doesn't work for:
OZYA-03A-OZYA-03A-03

I guess,
([^-\r\n]+-|[^-\r\n]+-[^-\r\n]+-)(\1.*)
or with start/end anchors,
^([^-\r\n]+-|[^-\r\n]+-[^-\r\n]+-)(\1.*)$
might work to some extent and the desired output is in the last capturing group:
(\1.*)
RegEx Demo 1
RegEx Demo 2
Test
using System;
using System.Text.RegularExpressions;
public class Example
{
public static void Main()
{
string pattern = #"([^-\r\n]+-|[^-\r\n]+-[^-\r\n]+-)(\1.*)";
string input = #"HF-01-HF-01-01
FBC-FBC-04
OZYA-03A-OZYA-03A-03
QC-QC-02
and want them to be returned like so:
HF-01-01
FBC-04
OZYA-03A-03
QC-02";
RegexOptions options = RegexOptions.Multiline;
foreach (Match m in Regex.Matches(input, pattern, options))
{
Console.WriteLine("'{0}' found at index {1}.", m.Value, m.Index);
}
}
}
If you wish to simplify/modify/explore the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.
RegEx Circuit
jex.im visualizes regular expressions:

I'm not sure if regex is the right tool here, but atleast it can be somewhat done with this short pattern:
^([A-Z0-9]+)-.*(\1.*)$
Explanation:
^ start of string
( group 1 start
[A-Z0-9]+ one or more capital letters or digits
) end group 1
- literal
.* any number of any chars
( group 2 start
\1 anything that was matched in group 1
.* any number of any chars
) end group 2 (this group will be used as the result)
$ end of string

Related

C# Regex to obtain string up until a pattern

I've always been really bad when it comes to using regular expressions but it is something I want to seriously understand because as we all know, it is quite useful.
This is for a personal project, to keep my folders organized and neat.
I have a bunch of folders with the following naming pattern XXXXXXXX.XXXXXXX.XXXXXX.SYY.EYY.SOMETHINGELSE
There can be any amount of X repeating separated by ".", but the SYY.EYY is always there. So what I want is a regular expression to retrieve all the text represented by XXX without the "." if possible up until the SYY.EYY pattern.
I managed to detect the pattern because YY are always numbers, so doing something like \d{2} will detect it but I'm wondering if its possible to also add the rest of the pattern to that \d{2}.
Any help is appreciate it :)

If the YY is as you stated 2 digits and you want to get the text except the . up until for example S11.E22 you could make use of the \G anchor and a capturing group to get the text without a dot.
The value is in the Match.Groups property.
\G(?!S[0-9]{2}\.E[0-9]{2})([^.]+)\.
In parts
\G Assert position at the end of previous match (start at the beginning)
(?! Negative lookahead, assert what is directly to the right is not
S[0-9]{2}\.E[0-9]{2} Math S, 2 digits, . E and 2 digits
) Close lookahead
( Capture group 1
[^.]+ Match 1+ times any char except a dot
) Close group 1
\. Match dot literal
Regex demo | C# demo
For example
string pattern = #"\G(?!S[0-9]{2}\.E[0-9]{2})([^.]+)\.";
string input = #"XXXXXXXX.XXXXXXX.XXXXXX.S11.E22.SOMETHINGELSE";
foreach (Match m in Regex.Matches(input, pattern))
{
Console.WriteLine(m.Groups[1].Value);
}
Output
XXXXXXXX
XXXXXXX
XXXXXX

You can "replace/cut" the "." with C#.
The regex to get up until the SYY.EYY can be like this:
.SYY.EYY$
Line ends with word -> Regex: ExampleWord$

I would do something like:
var leftPart = Regex.Match(x, "^.*?(?=SYY)").Captures.First().Value;
// this now has XXXXXXXX.XXXXXXX.XXXXXX.
// And we can:
var left = leftPart.Replace(".", " "); // or any other char

Regular expression only one character and 7 numbers

i want regex match only one char in any position of word and 7 numbers
match example:
1111111q
2222222q
111e1111
11e11111
i do this pattern but not working in all patterns:
[A-Za-z][0-9]{7}

Regular expressions match patterns. In your case, it would seem that the letter can be at any point in your string, which would mean that you would have a multitude of patterns which would need to be taken into consideration.
I think that for this case, you should not use regular expressions for simplicity's sake. I would recommend you take a look at the Char.isDigit(Char c) and Char.isLetter(Char c) methods and use counters to see that the string is in the format you are after.

there are readily available methods in C# for checking the conditions you want. I would use Regex if there is no parser or simple c# solution.
I would do like below
var str = "1111111u";
var isValid = str.Length ==8 &&
str.Where(char.IsDigit).Count() ==7 &&
str.Where(char.IsLetter).Count() ==1;

It is not that difficult in regex:
If the complete string has to match just use:
^(?=.{8}$)\d*[a-zA-Z]\d*$
See it here on regexr.
If this is a word in a larger text use:
\b(?=[a-z0-9]{8}\b)\d*[a-z]\d*\b
See it here on Regexr
\d*[a-z]\d* matches any amount of digits, followed by one letter, then again any amount of digits.
(?=[a-z0-9]{8} is a positive lookahead assertion, this ensures the length of 8 in total.
Important here is the use of anchors or word boundaries to avoid partial wrong matches.
If you really want to match any letter then use the Unicode property \p{L} instead of the character class:
^(?=.{8}$)\d*\p{L}\d*$

I can only come up with a "brute force" regex method:
foundMatch = Regex.IsMatch(subjectString,
#"\b
(?:[a-z]\d{7}|
\d[a-z]\d{6}|
\d{2}[a-z]\d{5}|
\d{3}[a-z]\d{4}|
\d{4}[a-z]\d{3}|
\d{5}[a-z]\d{2}|
\d{6}[a-z]\d{1}|
\d{7}[a-z])
\b",
RegexOptions.IgnoreCase | RegexOptions.IgnorePatternWhitespace);
Note the word boundary anchors, which you should remove if this pattern is part of a longer string.
Also note the IgnoreCase option, which you can remove if all letters will be lower case.
Edit: See #stema Answer -- much more concise regex

This will match what you want:
(\d{1}\w\d{6}|\d{2}\w\d{5}|\d{3}\w\d{4}|\d{4}\w\d{3}|\d{5}\w\d{2}|\d{6}\w\d{1}|\d{7}\w)
I generated it like this, in powershell:
$n = 6;
for ($i = 1; $i -le 6; $i++) {
write-host "\d{"$i"}\w\d{"$n"}"
$n--
}

Your example will only work when the character is the first character in the string.
The problem you've got is that you need a total of 7 digits, and absolutely only one character potentially within those 7 digits. This is not something that's possible with regular expressions as defined in theory, because you have to have a link between the two groups of digits to see how many are in the other group and regexes can't carry that kind of context around with them.
I was wondering if it was possible using a lookahead assertion to ensure there's only one letter, but the best I can do is ensuring there's no instance of two letters in a row, which doesn't cover all possible invalid cases. Thus I think you're going to have to find another method, as npinti suggested. So something like:
public static bool Match(string s) {
return (s.Length == 8) &&
(s.Where(Char.IsDigit).Count() == 7) &&
(s.Where(Char.IsLetter).Count() == 1);
}
But I haven't tested that.

just use this if you want one letter and 7 digit
"[A-Za-z]{1}[0-9]{7}|[0-9]{7}[A-Za-z]{1}|[0-9]{1}[A-Za-z]{1}[0-9]{6}[0-9]{1}|[0-9]{2}[A-Za-z]{1}[0-9]{5}|[0-9]{3}[A-Za-z]{1}[0-9]{4}|[0-9]{4}[A-Za-z]{1}[0-9]{3}|[0-9]{5}[A-Za-z]{1}[0-9]{2}"
and here a code snippet how you can iterate through your result
string st = "1111111q 2222222q 111e1111 11e11111";
string pattS = #"[A-Za-z]{1}[0-9]{7}|[0-9]{7}[A-Za-z]{1}|[0-9]{1}[A-Za-z]{1}[0-9]{6}[0-9]{1}|[0-9]{2}[A-Za-z]{1}[0-9]{5}|[0-9]{3}[A-Za-z]{1}[0-9]{4}|[0-9]{4}[A-Za-z]{1}[0-9]{3}|[0-9]{5}[A-Za-z]{1}[0-9]{2}";
Regex regex = new Regex(pattS);
var res = regex.Matches(st);
foreach (var re in res)
{
}
check here on rubular it covers all examples you provide

You can use this pattern:
^([0-9])(?:\1|[a-z](?!.*[a-z])){7}|[a-z]([0-9])\2{6}$

With Regex, you can do it in two steps. First you can remove the character, in whatever position it is:
string input = "111a1111";
Regex rgx = new Regex(#"[a-zA-Z]");
string output=rgx.Replace(input,"",1); // remove only one character
// output = "1111111"
then you can match with [0-9]{7} (if you don't want all digits to be the same)
or with ^(\d)\1{6}$ (if you want 7 occurrences of the same digit)

C# Regular expressions, retrieving two words separated by a comma, parenthesis operator

I've been playing around with retrieving data from a string using regular expression, mostly as an exercise for myself. The pattern that I'm trying to match looks like this:
"(SomeWord,OtherWord)"
After reading some documentation and looking at a cheat sheet I came to the conclusion that the following regex should give me 2 matches:
"\((\w),(\w)\)"
Because according to the documentation the parenthesis should do the following:
(pattern) Matches pattern and remembers the match. The matched
substring can be retrieved from the resulting Matches collection,
using Item [0]...[n]. To match parentheses characters ( ), use "\ (" or
"\ )".
However using the following code (removed error checking for conciseness) matches quite something different:
string line = "(A,B)";
string pattern = #"\((\w),(\w)\)";
MatchCollection matches = Regex.Matches(line, pattern);
string left = matches[0].Value;
string right = matches[1].Value;
Now I would expect left to become "A" and right to become "B". However left becomes "(A,B)" and there is no second match at all. What am I missing here?
(I know this example is trivial to solve without regexes but to learn how to properly use regexes I should be able to make something simple as this work)

You want the Groups member of the first match. In your example case there is only 1 match, which is the whole string. In the Groups collection you will have 3 items. Try this sample code, left should be A, and right should be B. If you look at the group[0] value it will be the whole string.
string line = "(A,B)";
string pattern = #"\((\w),(\w)\)";
MatchCollection matches = Regex.Matches(line, pattern);
GroupCollection groups = matches[0].Groups;
string left = groups[1].Value;
string right = groups[2].Value;

\w matches only one word character. If words have to contain at least one character, the expression should be:
string pattern = #"\((\w+),(\w+)\)";
if words may be empty:
string pattern = #"\((\w*),(\w*)\)";
+: means one or more repetitions.
*: means zero, one or more repetitions.
In any case, you will get one match with three groups, the first containing the whole string including the left and right parentheses, the two others the two words.

I think the problem is that you're confusing the concept of a match and a group.
A MatchCollection contains a list of strings that matched your entire regex, not just the parenthetical groups inside that Regex. For example, if the string you searched looked like this...
(A,B)(C,D)
...then you would have two matches: (A,B) and (C,D).
However, there's good news: you can get the groups from each match very easily, like so:
string line = "(A,B)";
string pattern = #"\((\w),(\w)\)";
MatchCollection matches = Regex.Matches(line, pattern);
string left = matches[0].Groups[1].Value;
string right = matches[0].Groups[2].Value;
That Groups variable is a collection of parenthetical groups from a single match.
Edit:
Olivier Jacot-Descombes made a very good point: we all got so hung up explaining match vs. group that we forgot to notice a second problem: \w will only match a SINGLE character. You need to add a quantifier (such as +) in order to grab more than one character at a time. Olivier's answer should explain that part clearly.

First off, it's one "match", with 2 "groups"...
I would recommend you name the groups anyway...
string pattern = #"\((?<FirstWord>\w+),(?<SecondWord>\w+)\)";
Then you could do...
Match m = Regex.Match(line, pattern);
string firstWord = m.Groups["FirstWord"].Value;

Since all you are looking for are the characters separated by a comma, you can simply use \w as your pattern. The matches will be A and B.
A handy site for testing your Regex is http://gskinner.com/RegExr/

.NET REGEX Matching matches empty strings

I have this
pattern:
[0-9]*\.?[0-9]*
Target:
X=113.3413475 Y=18.2054775
And i want to match the numbers. It matches find in testing software like http://regexpal.com/ and Regex Coach.
But in Dot net and http://derekslager.com/blog/posts/2007/09/a-better-dotnet-regular-expression-tester.ashx
I get:
Found 11 matches:
1.
2.
3.
4.
5.
6. 113.3413475
7.
8.
9.
10. 18.2054775
11.
String literals for use in programs:
C#
#"[0-9]*[\.]?[0-9]*"
Any one have any idea why i'm getting all these empty matches.
Thanks and Regards,
Kevin

Yes, that will match empty string. Look at it:
[0-9]* - zero or more digits
\.? - an optional period
[0-9]* - zero or more digits
Everything's optional, so an empty string matches.
It sounds like you always want there to be digits somewhere, for example:
[0-9]+\.[0-9]*|\.[0-9]+|[0-9]+
(The order here matters, as you want it to take the most possible.)
That works for me:
using System;
using System.Text.RegularExpressions;
class Test
{
static void Main(string[] args)
{
string x = "X=113.3413475 Y=18.2054775";
Regex regex = new Regex(#"[0-9]+\.[0-9]*|\.[0-9]+|[0-9]+");
var matches = regex.Matches(x);
foreach (Match match in matches)
{
Console.WriteLine(match);
}
}
}
Output:
113.3413475
18.2054775
There may well be better ways of doing it, admittedly :)

Try this one:
[0-9]+(\.[0-9]+)?
It's slightly different that Jon Skeet's answer in that it won't match .45, it requires either a number alone (e.g. 8) or a real decimal (e.g. 8.1 or 0.1)

Another alternative is to keep your original regex, and just assert it must have a number in it (maybe after a dot):
[0-9]*\.?[0-9]*
Goes to:
(?=\.?[0-9])[0-9]*\.?[0-9]*

The key problem is the *, which means "match zero or more of the preceding characters". The empty string matches zero or more digits, which is why you're getting all those matches.
Change your two *s to +s and you'll get what you want.

The problem with this regex is that it is completely optional in all the fields, so an empty string also is matched by it. I would consider adding all the cases. By the regex, I see you want the numbers with or without dot, and with or without a set of decimal digits. You can separate first those that contain only numbers [0-9]+, then those that contain numbers plus only a dot, [0-9]+\. and then join them all with | (or).
The problem with the regex as it is is that it allows cases that are not real numbers, for example, the cases in which the first set of numbers and the last set of numbers are empty (just a dot), so you have to put the valid cases explicitly.

Regex pattern = new Regex( #"[0-9]+[\.][0-9]+");
string info = "X=113.3413475 Y=18.2054775";
MatchCollection matches = pattern.Matches(info);
int count = 1;
foreach(Match match in matches)
{
Console.WriteLine("{0} : {1}", count++, match.Value);
}
//output
//1 : 113.3413475
//2 : 18.2054775
Replace your * with + and remove ? from your period case.
EDIT: from above conversation: #"[0-9]+.[0-9]*|.[0-9]+|[0-9]+", is the better case. catches 123, .123, 123.123 etc

regex for capturing digits and digit ranges

i have the following string
Fat mass loss was 2121,323.222 greater for GPLC (2–2.4kg vs. 0.5kg)
i want to capture
212,323.222
2-2.24
0.5
i.e. i want the above three results from the string,
can any one help me with this regex

I noticed that your hyphen in 2–2.4kg is not really hyphen, its a unicode 0x2013 "DASH".
So, here is another regex in C#
#"[0-9]+([,.\u2013-][0-9]+)*"
Test
MatchCollection matches = Regex.Matches("Fat mass loss was 2121,323.222 greater for GPLC (2–2.4kg vs. 0.5kg)", #"[0-9]+([,.\u2013-][0-9]+)*");
foreach (Match m in matches) {
Console.WriteLine(m.Groups[0]);
}
Here is the results, my console does not support printing unicode char 2013, so its "?" but its properly matched.
2121,323.222
2?2.4
0.5

Okay I didn't notice the C# tag until now. I will leave the answer but I know that's not what you expected, see if you can do something with it. Perhaps the title should have mentioned the programming language?
Sure:
Fat mass loss was (.*) greater for GPLC \((.*) vs. (.*)kg\)
Find your substrings in \1, \2 and \3.
If for Emacs, swap all parentheses and escaped parentheses.

How about something like this:
^.*((?:\d+,)*\d+(?:\.\d+)?).*(\d+(?:\.\d+)?(?:-\d+(?:\.\d+))?).*(\d+(?:\.\d+)).*$
A little more general, I think. I'm a little concerned about .* being greedy.

Fat mass loss was 2121,323.222 greater
for GPLC (2–2.4kg vs. 0.5kg)
a generalized extractor:
/\D+?([\d\,\.\-]+)/g
explanation:
/ # start pattern
\D+ # 1 or more non-digits
( # capture group 1
[\d,.-]+ # character class, 1 or more of digits, comma, period, hyphen
) # end capture group 1
/g # trailing regex g modifier (make regex continue after last match)
sorry I don't know c# well enough for a full writeup, but the pattern should plug right in.
see: http://www.radsoftware.com.au/articles/regexsyntaxadvanced.aspx for some implementation examples.

I came out with something like this atrocity:
-?\d(?:,?\d)*(?:\.(?:\d(?:,?\d)*\d|\d))?(?:[–-]-?\d(?:,?\d)*(?:\.(?:\d(?:,?\d)*\d|\d))?)?
Out of witch -?\d(?:,?\d)*(?:\.(?:\d(?:,?\d)*\d|\d))? is repeated twice, with – in the middle (note that this is a long hyphen).
This should take care of dots and commas outside of numbers, eg: hello,23,45.2-7world - will capture 23,45.2-7.

It looks like you're trying to find all numbers in the string (possibly with commas inside the number), and all ranges of numbers such as "2-2.4". Here is a regex that should work:
\d+(?:[,.-]\d+)*
From C# 3, you can use it like this:
var input = "Fat mass loss was 2121,323.222 greater for GPLC (2-2.4kg vs. 0.5kg)";
var pattern = #"\d+(?:[,.-]\d+)*";
var matches = Regex.Matches(input, pattern);
foreach ( var match in matches )
Console.WriteLine(match.Value);

Hmm, this is a tricky question, especially because the input string contains unicode character – (EN DASH) instead of - (HYPHEN-MINUS). Therefore the correct regex to match the numbers in the original string would be:
\d+(?:[\u2013,.]\d+)*
If you want a more generic approach would be:
\d+(?:[\p{Pd}\p{Pc}\p{Po}]\d+)*
which matches dash punctuation, connecter punctuation and other punctuation. See here for more information about those.
An implementation in C# would look like this:
string input = "Fat mass loss was 2121,323.222 greater for GPLC (2–2.4kg vs. 0.5kg)";
try {
Regex rx = new Regex(#"\d+(?:[\p{Pd}\p{Pc}\p{Po}\p{C}]\d+)*", RegexOptions.IgnoreCase | RegexOptions.Multiline);
Match match = rx.Match(input);
while (match.Success) {
// matched text: match.Value
// match start: match.Index
// match length: match.Length
match = match.NextMatch();
}
} catch (ArgumentException ex) {
// Syntax error in the regular expression
}

Let's try this one :
(?=\d)([0-9,.-]+)(?<=\d)
It captures all expressions containing only :
"[0-9,.-]" characters,
must start with a digit "(?=\d)",
must finish with a digit "(?<=\d)"
It works with a single digit expression and does not include beginning or trailing [.,-].
Hope this helps.

I got the solution to my problem.
The following is the Regex that gave my desired result:
(([0-9]+)([–.,-]*))+

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

How to match any repeated chunks of characters? - c#

Related

C# Regex to obtain string up until a pattern

Regular expression only one character and 7 numbers

C# Regular expressions, retrieving two words separated by a comma, parenthesis operator

.NET REGEX Matching matches empty strings

regex for capturing digits and digit ranges

Categories

Resources