C# regular expression for finding a certain pattern in a text - c#

I'm trying to write a program that can replace bible verses within a document with any desired translation. This is useful for older books that contain a lot of KJV referenced verses. The most difficult part of the process is coming up with a way to extract the verses within a document.
I find that most books that place bible verses within the text use a structure like "N"(BookName chapter#:verse#s), where N is the verse text, the quotations are literal and the parens are also literal. I've been having problems coming up with a regular expression to match these in a text.
The latest regular expression I'm trying to use is this: \"(.+)\"\s*\(([\w. ]+[0-9\s]+[:][\s0-9\-]+.*)\). I'm having trouble where it won't find all the matches.
Here is the regex101 of it with a sample. https://regex101.com/r/eS5oT8/1
Is there anyway to solve this using a regular expression? Any help or suggestions would be greatly appreciated.

It's worth mentioning that the site you were using to test this relies on Javascript Regular Expressions, which require the g modifier to be explicitly defined, unlike C# (which is global by default).
You can adjust your expression slightly and ensure that you escape your double-quotes properly :
// Updated expression with escaped double-quotes and other minor changes
var regex = new Regex(#"\""([^""]+)\""\s*\(([\w. ]+[\d\s]+[:][\s\d\-]+[^)]*)\)");
And then use the Regex.Matches() method to find all of the matches in your string :
// Find each of the matches and output them
foreach(Match m in regex.Matches(input))
{
// Output each match here (using Console Example)
Console.WriteLine(m.Value);
}
You can see it in action in this working example with example output seen below :

Use the "g" modifier.
g modifier: global. All matches (don't return on first match)
See the Regex Demo

you can try with example given in MSDN here is the link
https://msdn.microsoft.com/en-us/library/0z2heewz(v=vs.110).aspx
using System;
using System.Text.RegularExpressions;
public class Example
{
public static void Main()
{
string input = "ablaze beagle choral dozen elementary fanatic " +
"glaze hunger inept jazz kitchen lemon minus " +
"night optical pizza quiz restoration stamina " +
"train unrest vertical whiz xray yellow zealous";
string pattern = #"\b\w*z+\w*\b";
Match m = Regex.Match(input, pattern);
while (m.Success) {
Console.WriteLine("'{0}' found at position {1}", m.Value, m.Index);
m = m.NextMatch();
}
}
}
// The example displays the following output:
// 'ablaze' found at position 0
// 'dozen' found at position 21
// 'glaze' found at position 46
// 'jazz' found at position 65
// 'pizza' found at position 104
// 'quiz' found at position 110
// 'whiz' found at position 157
// 'zealous' found at position 174

How about starting with this as a guide:
(?<quote>"".+"") # a series of any characters in quotes
\s + # followed by spaces
\( # followed by a parenthetical expression
(?<book>\d*[a-z.\s] *) # book name (a-z, . or space) optionally preceded by digits. e.g. '1 Cor.'
(?<chapter>\d+) # chapter e.g. the '1' in 1:2
: # semicolon
(?<verse>\d+) # verse e.g. the '2' in 1:2
\)
Using the options:
RegexOptions.IgnorePatternWhitespace | RegexOptions.Singleline | RegexOptions.IgnoreCase
The expression above will give you named captures of every element in the match for easy parsing (e.g., you'll be able to pick out quote, book, chapter and verse) by looking at, e.g., match.Groups["verse"].
Full code:
var input = #"Jesus said, ""'Love your neighbor as yourself.'
There is no commandment greater than these"" (Mark 12:31).";
var bibleQuotesRegex =
#"(?<quote>"".+"") # a series of any characters in quotes
\s + # followed by spaces
\( # followed by a parenthetical expression
(?<book>\d*[a-z.\s] *) # book name (a-z, . or space) optionally preceded by digits. e.g. '1 Cor.'
(?<chapter>\d+) # chapter e.g. the '1' in 1:2
: # semicolon
(?<verse>\d+) # verse e.g. the '2' in 1:2
\)";
foreach(Match match in Regex.Matches(input, bibleQuotesRegex, RegexOptions.IgnorePatternWhitespace | RegexOptions.Singleline | RegexOptions.IgnoreCase))
{
var bibleQuote = new
{
Quote = match.Groups["quote"].Value,
Book = match.Groups["book"].Value,
Chapter = int.Parse(match.Groups["chapter"].Value),
Verse = int.Parse(match.Groups["verse"].Value)
};
//do something with it.
}

After you've added "g", also be careful if there are multiple verses without any '\n' character in between, because "(.*)" will treat them as one long match instead of multiple verses. You will want something like "([^"]*)" to prevent that.

Related

Regex to match string between curly braces (that allows to escape them via 'doubling')

I was using the regex from Extract values within single curly braces:
(?<!{){[^{}]+}(?!})
However, it does not cover the user case #3 (see below).
I would like to know if it's possible to define a regular expression that satisfied the use cases below
Use case 1
Given:
Hola {name}
It should match {name} and capture name
But I would like to be able to escape curly braces when needed by doubling them, like C# does for interpolated strings. So, in a string like
Use case 2
Hola {name}, this will be {{unmatched}}
The {{unmatched}} part should be ignored because it uses them doubled. Notice the {{ and }}.
Use case 3
In the last, most complex case, a text like this:
Buenos {{{dias}}}
The text {dias} should be a match (and capture dias) because the first outer-most doubled curly braces should be interpreted just like another character (they are escaped) so it should match: {{{dias}}}
My ultimate goal is to replace the matches later with another string, like a variable.
EDIT
This 4th use case pretty much summarized the whole requirements:
Given:
Hola {name}, buenos {{{dias}}}
Results in:
Match 1:
Matched text: {name}
Captured text: name
Match 2:
Matched text: {dias}
Captured text: dias
To optionally match double curly's, you could use an if clause and take the value from capture group 2.
(?<!{)({{)?{([^{}]+)}(?(1)}})(?!})
Explanation
(?<!{) Assert not { directly to the left
({{)? Optionally capture {{ in group 1
{([^{}]+)} Match from { till } without matching { and } in between
(?(1)}}) If clause, if group 1 exists, match }}
(?!}) Assert not } directly to the right
.Net regex demo | C# demo
string pattern = #"(?<!{)({{)?{([^{}]+)}(?(1)}})(?!})";
string input = #"Hola {name}
Hola {name}, this will be {{unmatched}}
Buenos {{{dias}}}";
foreach (Match m in Regex.Matches(input, pattern))
{
Console.WriteLine(m.Groups[2].Value);
}
Output
name
name
dias
If the double curly's should be balanced, you might use this approach:
(?<!{){(?>(?<={){{(?<c>)|([^{}]+)|}}(?=})(?<-c>))*(?(c)(?!))}(?!})
.NET regex demo
You can use
(?<!{)(?:{{)*{([^{}]*)}(?:}})*(?!})
See the .NET regex demo.
In C#, you can use
var results = Regex.Matches(text, #"(?<!{)(?:{{)*{([^{}]*)}(?:}})*(?!})").Cast<Match>().Select(x => x.Groups[1].Value).ToList();
Alternatively, to get full matches, wrap the left- and right-hand contexts in lookarounds:
(?<=(?<!{)(?:{{)*{)[^{}]*(?=}(?:}})*(?!}))
See this regex demo.
In C#:
var results = Regex.Matches(text, #"(?<=(?<!{)(?:{{)*{)[^{}]*(?=}(?:}})*(?!}))")
.Cast<Match>()
.Select(x => x.Value)
.ToList();
Regex details
(?<=(?<!{)(?:{{)*{) - immediately to the left, there must be zero or more {{ substrings not immediately preceded with a { char and then {
[^{}]* - zero or more chars other than { and }
(?=}(?:}})*(?!})) - immediately to the right, there must be }, zero or more }} substrings not immediately followed with a } char.

Get the middle part of a filename using regex

I need a regex that can return up to 10 characters in the middle of a file name.
filename: returns:
msl_0123456789_otherstuff.csv -> 0123456789
msl_test.xml -> test
anythingShort.w1 -> anythingSh
I can capture the beginning and end for removal with the following regex:
Regex.Replace(filename, "(^msl_)|([.][[:alnum:]]{1,3}$)", string.Empty); *
but I also need to have only 10 characters when I am done.
Explanation of the regex above:
(^msl_) - match lines that start with "msl_"
| - or
([.] - match a period
[[:alnum]]{1,3} - followed by 1-3 alphanumeric characters
$) - at the end of the line
Note [[:alnum:]] can't work in a .NET regex, because it does not support POSIX character classes. You may use \w (to match letters, digits, underscores) or [^\W_] (to match letters or digits).
You can use your regex and just keep the first 10 chars in the string:
new string(Regex.Replace(s, #"^msl_|\.\w{1,3}$","").Take(10).ToArray())
See the C# demo online:
var strings = new List<string> { "msl_0123456789_otherstuff.csv", "msl_test.xml", "anythingShort.w1" };
foreach (var s in strings)
{
Console.WriteLine("{0} => {1}", s, new string(Regex.Replace(s, #"^msl_|\.\w{1,3}$","").Take(10).ToArray()));
}
Output:
msl_0123456789_otherstuff.csv => 0123456789
msl_test.xml => test
anythingShort.w1 => anythingSh
Using replace with the alternation, removes either of the alternatives from the start and the end of the string, but it will also work when the extension is not present and does not take the number of chars into account in the middle.
If the file extension should be present you might use a capturing group and make msl_ optional at the beginning.
Then match 1-10 times a word character except the _ followed by matching optional word characters until the .
^(?:msl_)?([^\W_]{1,10})\w*\.[^\W_]{2,}$
.NET regex demo (Click on the table tab)
A bit broader match could be using \S instead of \w and match until the last dot:
^(?:msl_)?(\S{1,10})\S*\.[^\W_]{2,}$
See another regex demo | C# demo
string[] strings = {"msl_0123456789_otherstuff.csv", "msl_test.xml","anythingShort.w1", "123456testxxxxxxxx"};
string pattern = #"^(?:msl_)?(\S{1,10})\S*\.[^\W_]{2,}$";
foreach (String s in strings) {
Match match = Regex.Match(s, pattern);
if (match.Success)
{
Console.WriteLine(match.Groups[1]);
}
}
Output
0123456789
test
anythingSh

Extract phone numbers and exclude extraneous characters

I'm trying to create a regex which will extract a complete phone number from a string (which is the only thing in the string) but leaving out any cruft like decorative brackets, etc.
The pattern I have mostly appears to work, but returns a list of matches - whereas I want it to return the phone number with the characters removed. Unfortunately, it completely fails if I add the start and end of line matchers...
^(?!\(\d+\)\s*){1}(?:[\+\d\s]*)$
Without the ^ and $ this matches the following numbers:
12345-678-901 returns three groups: 12345 678 901
+44-123-4567-8901 returns four groups: +44 123 4567 8901
(+48) 123 456 7890 returns four groups: +48 123 456 7890
How can I get the groups to be returned as a single, joined up whole?
Other than that, the only change I would like to include is to return nothing if there are any non-numeric, non-bracket, non-+ characters anywhere. So, this should fail:
(+48) 123 burger 7890
I'd keep it simple, makes it more readable and maintainable:
public string CleanPhoneNumber(string messynumber){
if(Regex.IsMatch(messynumber, "[a-z]"))
return "";
else
return Regex.Replace(messynumber, "[^0-9+]", "");
}
If any alphameric characters are present (extend this range if you wish) return blank else replace every char that is not 0-9 or +, with nothing. This produces output like 0123456789 and +481234567 with all the brackets, spaces and hyphens etc removed too. If you want to keep those in the output, add them to the Regex
Side note: It's not immediately clear or me what you think is "cruft" that should be stripped (non a-z?) and what you think is "cruft" that should cause blank (a-z?). I struggled with this because you said (paraphrase) "non digit, non bracket, non plus should cause blank" but earlier in your examples your processing permitted numbers that had hyphens and also spaces - being strictly demanding of spec hyphens/spaces would be "cruft that causes the whole thing to return blank" too
I've assumed that it's lowercase chars from the "burger" example but as noted you can extend the range in the IF part should you need to include other chars that return blank
If you have a lot of them to do maybe pre compile a regex as a class level variable and use it in the method:
private Regex _strip = new Regex( "[^0-9+]", RegexOptions.Compiled);
public string CleanPhoneNumber(string messynumber){
if(Regex.IsMatch(messynumber, "[a-z]"))
return "";
else
return _strip.Replace(messynumber, "");
}
...
for(int x = 0; x < millionStrArray.Length; x++)
millionStrArray[x] = CleanPhoneNumber(millionStrArray[x], "");
I don't think you'll gain much from compiling the IsMatch one but you could try it in a similar pattern
Other options exist if you're avoiding regex, you cold even do it using LINQ, or looping on char arrays, stringbuilders etc. Regex is probably the easiest in terms of short maintainable code
The strategy here is to use a look ahead and kick out (fail) a match if word characters are found.
Then when there are no characters, it then captures the + and all numbers into a match group named "Phone". We then extract that from the match's "Phone" capture group and combine as such:
string pattern = #"
^
(?=[\W\d+\s]+\Z) # Only allows Non Words, decimals and spaces; stop match if letters found
(?<Phone>\+?) # If a plus found at the beginning; allow it
( # Group begin
(?:\W*) # Match but don't *capture* any non numbers
(?<Phone>[\d]+) # Put the numbers in.
)+ # 1 to many numbers.
";
var number = "+44-123-33-8901";
var phoneNumber =
string.Join(string.Empty,
Regex.Match(number,
pattern,
RegexOptions.IgnorePatternWhitespace // Allows us to comment the pattern
).Groups["Phone"]
.Captures
.OfType<Capture>()
.Select(cp => cp.Value));
// phoneNumber is `+44123338901`
If one looks a the match structure, the data it houses is this:
Match #0
[0]: +44-123-33-8901
["1"] → [1]: -8901
→1 Captures: 44, -123, -33, -8901
["Phone"] → [2]: 8901
→2 Captures: +, 44, 123, 33, 8901
As you can see match[0] contains the whole match, but we only need the captures under the "Phone" group. With those captures { +, 44, 123, 33, 8901 } we now can bring them all back together by the string.Join.

Match only the nth occurrence using a regular expression

I have a string with 3 dates in it like this:
XXXXX_20160207_20180208_XXXXXXX_20190408T160742_xxxxx
I want to select the 2nd date in the string, the 20180208 one.
Is there away to do this purely in the regex, with have to resort to pulling out the 2 match in code. I'm using C# if that matters.
Thanks for any help.
You could use
^(?:[^_]+_){2}(\d+)
And take the first group, see a demo on regex101.com.
Broken down, this says
^ # start of the string
(?:[^_]+_){2} # not _ + _, twice
(\d+) # capture digits
C# demo:
var pattern = #"^(?:[^_]+_){2}(\d+)";
var text = "XXXXX_20160207_20180208_XXXXXXX_20190408T160742_xxxxx";
var result = Regex.Match(text, pattern)?.Groups[1].Value;
Console.WriteLine(result); // => 20180208
Try this one
MatchCollection matches = Regex.Matches(sInputLine, #"\d{8}");
string sSecond = matches[1].ToString();
You could use the regular expression
^(?:.*?\d{8}_){1}.*?(\d{8})
to save the 2nd date to capture group 1.
Demo
Naturally, for n > 2, replace {1} with {n-1} to obtain the nth date. To obtain the 1st date use
^(?:.*?\d{8}_){0}.*?(\d{8})
Demo
The C#'s regex engine performs the following operations.
^ # match the beginning of a line
(?: # begin a non-capture group
.*? # match 0+ chars lazily
\d{8} # match 8 digits
_ # match '_'
) # end non-capture group
{n} # execute non-capture group n (n >= 0) times
.*? # match 0+ chars lazily
(\d{8}) # match 8 digits in capture group 1
The important thing to note is that the first instance of .*?, followed by \d{8}, because it is lazy, will gobble up as many characters as it can until the next 8 characters are digits (and are not preceded or followed by a digit. For example, in the string
_1234abcd_efghi_123456789_12345678_ABC
capture group 1 in (.*?)_\d{8}_ will contain "_1234abcd_efghi_123456789".
You can use System.Text.RegularExpressions.Regex
See the following example
Regex regex = new Regex(#"^(?:[^_]+_){2}(\d+)"); //Expression from Jan's answer just showing how to use C# to achieve your goal
GroupCollection groups = regex.Match("XXXXX_20160207_20180208_XXXXXXX_20190408T160742_xxxxx").Groups;
if (groups.Count > 1)
{
Console.WriteLine(groups[1].Value);
}

Regular expression matching a given structure

I need to generate a regex to match any string with this structure:
{"anyWord"}{"aSpace"}{"-"}{"anyLetter"}
How can I do it?
Thanks
EDIT
I have tried:
string txt="print -c";
string re1="((?:[a-z][a-z]+))"; // Word 1
Regex r = new Regex(re1,RegexOptions.IgnoreCase|RegexOptions.Singleline);
Match m = r.Match(txt);
if (m.Success)
{
String word1=m.Groups[1].ToString();
Console.Write("("+word1.ToString()+")"+"\n");
}
Console.ReadLine();
but this only matches the word "print"
This would be pretty straight-forward :
[a-zA-Z]+\s\-[a-zA-Z]
explained as follows :
[a-zA-Z]+ # Matches 1 or more letters
\s # Matches a single space
\- # Matches a single hyphen / dash
[a-zA-Z] # Matches a single letter
If you needed to implement this in C#, you could just use the Regex class and specifically the Regex.Matches() method:
var matches = Regex.Matches(yourString,#"[a-zA-Z]+\s\-[a-zA-Z]");
Some example matching might look like this :

Categories