Extracting Titles from strings with RegEx

Extracting Titles from strings with RegEx - c#

I'm facing a problem caused by having to extract titles of programs from small pieces of strings whose structure can't be predicted at all. There are some patterns like you can see below, and each string must be evaluated to see if it matches any of those structures to get me able to properly get the title.
I've bought Mastering Regular Expressions but the time that I have to accomplish this doesn't allow me to be studing the book and trying to get the necessary introduction to this (interesting but particular) Theme.
Perharps, someone experienced in this area could help me to understand how to accomplish this job?
Some random Name 2 - Ep.1
=> Some random Name 2
Some random Name - Ep.1
=> Some random Name
Boff another 2 name! - Ep. 228
=> Boff another 2 name!
Another one & the rest - T1 Ep. 2
=>Another one & the rest
T5 - Ep. 2 Another Name
=> Another Name
T3 - Ep. 3 - One More with an Hyfen
=> One More with an Hyfen
Another one this time with a Date - 02/12/2012
=>Another one this time with a Date
10 Aug 2012 - Some Other 2 - Ep. 2
=> Some Other 2
Ep. 93 - Some program name
=> Some Program name
Someother random name - Epis. 1 e 2
=> Someother random name
The Last one with something inside parenthesis (V.O.)
=> The Last one with something inside parenthesis
As you may see the titles that I want to extract from the given string may have Numbers, special characters like &, and characters from a-zA-Z (i guess that's all)
The complex part comes when having to know if it has one space or more after the title and is followed by a hyphen and if it haves zero or more spaces until Ep. (i can't explain this, it's just complex.)

This program will handle your cases. The main principle is that it removes a certain sequence if present in the beginnign or the end of the string. You'll have to maintain the list of regular expressions if the format of the strings you want to remove will change or change the order of them as needed.
using System;
using System.Text.RegularExpressions;
public class MyClass
{
static string [] strs =
{
"Some random Name 2 - Ep.1",
"Some random Name - Ep.1",
"Boff another 2 name! - Ep. 228",
"Another one & the rest - T1 Ep. 2",
"T5 - Ep. 2 Another Name",
"T3 - Ep. 3 - One More with an Hyfen",
#"Another one this time with a Date - 02/12/2012",
"10 Aug 2012 - Some Other 2 - Ep. 2",
"Ep. 93 - Some program name",
"Someother random name - Epis. 1 e 2",
"The Last one with something inside parenthesis (V.O.)"};
static string [] regexes =
{
#"T\d+",
#"\-",
#"Ep(i(s(o(d(e)?)?)?)?)?\s*\.?\s*\d+(\s*e\s*\d+)*",
#"\d{2}\/\d{2}\/\d{2,4}",
#"\d{2}\s*[A-Z]{3}\s*\d{4}",
#"T\d+",
#"\-",
#"\!",
#"\(.+\)",
};
public static void Main()
{
foreach(var str in strs)
{
string cleaned = str.Trim();
foreach(var cleaner in regexes)
{
cleaned = Regex.Replace(cleaned, "^" + cleaner, string.Empty, RegexOptions.IgnoreCase).Trim();
cleaned = Regex.Replace(cleaned, cleaner + "$", string.Empty, RegexOptions.IgnoreCase).Trim();
}
Console.WriteLine(cleaned);
}
Console.ReadKey();
}

If it's only about checking for patterns, and not actually extracting the title name, let me have a go:
With #"Ep(is)?\.?\s*\d+" you can check for strings such as "Ep1", "Ep01", "Ep.999", "Ep3", "Epis.0", "Ep 11" and similar (it also detects multiple whitespaces between Ep and the numeral).
You may want to use the RegexOptions.IgnoreCase in case you want to match "ep1" as well as "Ep1" or "EP1"
If you are certain, that no name will include a "-" and that this character separates name from episode-info, you can try to split the string like this:
string[] splitString = inputString.Split(new char[] {'-'});
foreach (string s in splitString)
{
s.Trim() // removes all leading or trailing whitespaces
}
You'll have the name in either splitString[0] or splitString[1] and the episode-info in the other.
To search for dates, you can use this: #"\d{1,4}(\\|/|.|,)\d{1,2}(\\|/|.|,)\d{1,4}" which can detect dates with the year to the front or the back written with 1 to 4 decimals (except for the center value, which can be 1 to 2 decimals long) and separated with a back-slash, a slash, a comma or a dot.
Like I mentioned before: this will not allow your program to extract the actual title, only to find out if such strings exist (those strings may still be part of the title itself)
Edit:
A way to get rid of multiple whitespaces is to use inputString = Regex.Replace(inputString, "\s+", " ") which replaces multiple whitespaces with a single whitespace. Maybe you have underscores instead of whitespaces? Such as: "This_is_a_name", in which case you might want to use inputString = Regex.Replace(inputString, "_+", " ") before removing the multiple whitespaces.

Related

REGEX Matching string nonconsecutively

I'm trying to understand how to match a specific string that's held within an array (This string will always be 3 characters long, ex: 123, 568, 458 etc) and I would match that string to a longer string of characters that could be in any order (9841273 for example). Is it possible to check that at least 2 of the 3 characters in the string match (in this example) strMoves? Please see my code below for clarification.
private readonly string[] strSolutions = new string[8] { "123", "159", "147", "258", "357", "369", "456", "789" };
Private Static string strMoves = "1823742"
foreach (string strResult in strSolutions)
{
Regex rgxMain = new Regex("[" + strMoves + "]{2}");
if (rgxMain.IsMatch(strResult))
{
MessageBox.Show(strResult);
}
}
The portion where I have designated "{2}" in Regex is where I expected the result to check for at least 2 matching characters, but my logic is definitely flawed. It will return true IF the two characters are in consecutive order as compared to the string in strResult. If it's not in the correct order it will return false. I'm going to continue to research on this but if anyone has ideas on where to look in Microsoft's documentation, that would be greatly appreciated!
Correct order where it would return true: "144257" when matched to "123"
incorrect order: "35718" when matched to "123"
The 3 is before the 1, so it won't match.

You can use the following solution if you need to find at least two different not necessarily consecutive chars from a specified set in a longer string:
new Regex($#"([{strMoves}]).*(?!\1)[{strMoves}]", RegexOptions.Singleline)
It will look like
([1823742]).*(?!\1)[1823742]
See the regex demo.
Pattern details:
([1823742]) - Capturing group 1: one of the chars in the character class
.* - any zero or more chars as many as possible (due to RegexOptions.Singleline, . matches any char including newline chars)
(?!\1) - a negative lookahead that fails the match if the next char is a starting point of the value stored in the Group 1 memory buffer (since it is a single char here, the next char should not equal the text in Group 1, one of the specified digits)
[1823742] - one of the chars in the character class.

Regex.Replace using regular expression as replacement

I am new to C# programming language and came across the following problem
I have a string " avenue 4 TH some more words". I want to remove space between 4 and TH. I have written a regex which helps in determining whether "4 TH" is available in a string or not.
[0-9]+\s(th|nd|st|rd)
string result = "avanue 4 TH some more words";
var match = Regex.IsMatch(result,"\\b" + item + "\\b",RegexOptions.IgnoreCase) ;
Console.WriteLine(match);//True
Is there anything in C# which will remove the space
something likeRegex.Replace(result, "[0-9]+\\s(th|nd|st|rd)", "[0-9]+(th|nd|st|rd)",RegexOptions.IgnoreCase);
so that end result looks like
avenue 4TH some more words

You may use
var pattern = #"(?i)(\d+)\s*(th|[nr]d|st)\b";
var match = string.Concat(Regex.Match(result, pattern)?.Groups.Cast<Group>().Skip(1));
See the C# demo yielding 4TH.
The regex - (?i)(\d+)\s*(th|[nr]d|st)\b - matches 1 or more digits capturing the value into Group 1, then 0 or more whitespaces are matched with \s*, and then th, nd, rd or st as whole words (as \b is a word boundary) are captured into Group 2.
The Regex.Match(result, pattern)? part tries to match the pattern in the string. If there is a match, the match object Groups property is accessed and all groups are cast to aGrouplist withGroups.Cast(). Since the first group is the whole match value, we.Skip(1)` it.
The rest - the values of Group 1 and Group 2 - are concatenated with string.Concat.

Extract phone numbers and exclude extraneous characters

I'm trying to create a regex which will extract a complete phone number from a string (which is the only thing in the string) but leaving out any cruft like decorative brackets, etc.
The pattern I have mostly appears to work, but returns a list of matches - whereas I want it to return the phone number with the characters removed. Unfortunately, it completely fails if I add the start and end of line matchers...
^(?!\(\d+\)\s*){1}(?:[\+\d\s]*)$
Without the ^ and $ this matches the following numbers:
12345-678-901 returns three groups: 12345 678 901
+44-123-4567-8901 returns four groups: +44 123 4567 8901
(+48) 123 456 7890 returns four groups: +48 123 456 7890
How can I get the groups to be returned as a single, joined up whole?
Other than that, the only change I would like to include is to return nothing if there are any non-numeric, non-bracket, non-+ characters anywhere. So, this should fail:
(+48) 123 burger 7890

I'd keep it simple, makes it more readable and maintainable:
public string CleanPhoneNumber(string messynumber){
if(Regex.IsMatch(messynumber, "[a-z]"))
return "";
else
return Regex.Replace(messynumber, "[^0-9+]", "");
}
If any alphameric characters are present (extend this range if you wish) return blank else replace every char that is not 0-9 or +, with nothing. This produces output like 0123456789 and +481234567 with all the brackets, spaces and hyphens etc removed too. If you want to keep those in the output, add them to the Regex
Side note: It's not immediately clear or me what you think is "cruft" that should be stripped (non a-z?) and what you think is "cruft" that should cause blank (a-z?). I struggled with this because you said (paraphrase) "non digit, non bracket, non plus should cause blank" but earlier in your examples your processing permitted numbers that had hyphens and also spaces - being strictly demanding of spec hyphens/spaces would be "cruft that causes the whole thing to return blank" too
I've assumed that it's lowercase chars from the "burger" example but as noted you can extend the range in the IF part should you need to include other chars that return blank
If you have a lot of them to do maybe pre compile a regex as a class level variable and use it in the method:
private Regex _strip = new Regex( "[^0-9+]", RegexOptions.Compiled);
public string CleanPhoneNumber(string messynumber){
if(Regex.IsMatch(messynumber, "[a-z]"))
return "";
else
return _strip.Replace(messynumber, "");
}
...
for(int x = 0; x < millionStrArray.Length; x++)
millionStrArray[x] = CleanPhoneNumber(millionStrArray[x], "");
I don't think you'll gain much from compiling the IsMatch one but you could try it in a similar pattern
Other options exist if you're avoiding regex, you cold even do it using LINQ, or looping on char arrays, stringbuilders etc. Regex is probably the easiest in terms of short maintainable code

The strategy here is to use a look ahead and kick out (fail) a match if word characters are found.
Then when there are no characters, it then captures the + and all numbers into a match group named "Phone". We then extract that from the match's "Phone" capture group and combine as such:
string pattern = #"
^
(?=[\W\d+\s]+\Z) # Only allows Non Words, decimals and spaces; stop match if letters found
(?<Phone>\+?) # If a plus found at the beginning; allow it
( # Group begin
(?:\W*) # Match but don't *capture* any non numbers
(?<Phone>[\d]+) # Put the numbers in.
)+ # 1 to many numbers.
";
var number = "+44-123-33-8901";
var phoneNumber =
string.Join(string.Empty,
Regex.Match(number,
pattern,
RegexOptions.IgnorePatternWhitespace // Allows us to comment the pattern
).Groups["Phone"]
.Captures
.OfType<Capture>()
.Select(cp => cp.Value));
// phoneNumber is `+44123338901`
If one looks a the match structure, the data it houses is this:
Match #0
[0]: +44-123-33-8901
["1"] → [1]: -8901
→1 Captures: 44, -123, -33, -8901
["Phone"] → [2]: 8901
→2 Captures: +, 44, 123, 33, 8901
As you can see match[0] contains the whole match, but we only need the captures under the "Phone" group. With those captures { +, 44, 123, 33, 8901 } we now can bring them all back together by the string.Join.

using Regex to iterate over a string and search for 3 consecutive hyphens and replace it with [space][hyphen][space]

I currently have a string which looks like this when it is returned :
//This is the url string
// the-great-debate---toilet-paper-over-or-under-the-roll
string name = string.Format("{0}",url);
name = Regex.Replace(name, "-", " ");
And when I perform the following Regex operation it becomes like this :
the great debate toilet paper over or under the roll
However, like I mentioned in the question, I want to be able to apply regex to the url string so that I have the following output:-
the great debate - toilet paper over or under the roll
I would really appreciate any assistance.
[EDIT] However, not all the strings look like this, some of them just have a single hyphen so the above method work
world-water-day-2016
and it changes to
world water day 2016
but for this one:
the-great-debate---toilet-paper-over-or-under-the-roll
I need a way to check if the string has 3 hyphens than replace those 3 hyphens with [space][hyphen][space]. And than replace all the remaining single hyphens between the words with space.

First of all, there is always a very naive solution to this kind of problem: you replace your specific matches in context with some chars that are not usually used in the current environment and after replacing generic substrings you may replace the temporary substrings with the necessary exception.
var name = url.Replace("---", "[ \uFFFD ]").Replace("-", " ").Replace("[ \uFFFD ]", " - ");
You may also use a regex based replacement that matches either a 3-hyphen substring capturing it, or just match a single hyphen, and then check if Group 1 matched inside a match evaluator (the third parameter to Regex.Replace can be a Match evaluator method).
It will look like
var name = Regex.Replace(url, #"(---)|-", m => m.Groups[1].Success ? " - " : " ");
See the C# demo.
So, when (---) part matches, the 3 hyphens are put into Group 1 and the .Success property is set to true. Thus, m => m.Groups[1].Success ? " - " : " " replaces 3 hyphens with space+-+space and 1 hyphen (that may be actually 1 of the 2 consecutive hyphens) with a space.

Here's a solution using LINQ rather than Regex:
var str = "the-great-debate---toilet-paper-over-or-under-the-roll";
var result = str.Split(new string[] {"---"}, StringSplitOptions.None)
.Select(s => s.Replace("-", " "))
.Aggregate((c,n) => $"{c} - {n}");
// result = "the great debate - toilet paper over or under the roll"
Split the string up based on the ---, then remove hyphens from each substring, then join them back together.

The easy way:
name = Regex.Replace(name, "\b-|-\b", " ");
The show-off way:
name = Regex.Replace(name, "(\b)?-(?(1)|\b)", " ");

Regex expression to replace colon grouped integers into years, months, and days and drop leading zeroes

I'm pretty bad at Regex (C#) with my attempts at doing the following giving non-sense results.
Given string: 058:09:07
where only the last two digits are guaranteed, I need the result of:
"58y 9m 7d"
The needed rules are:
The last two digits "07" are days group and always present. If "00", then only the last "0" is to be printed,
The group immediately to the left of "07" which ends with ":" signify the months and are only present if enough days are present to lead into months. Again, if "00", then only the last "0" is to be printed,
The group immediately to the left of "09:" which ends with ":" signify years and will only be present if more then 12 months are needed.
In each group a leading "0" will be dropped.
(This is the result of an age calculation where 058:09:07 means 58 years, 9 months, and 7 days old. The ":" (colon) always used to separate years from months from days).
Example:
058:09:07 --> 58y 9m 7d
01:00 --> 1m 0d
08:00:00 --> 8y 0m 0d
00 --> 0d
Any help is most appreciated.

Well, you can pretty much do this without regex.
var str = "058:09:07";
var integers = str.Split(':').Select(int.Parse).ToArray();
var result = "";
switch(integers.Length)
{
case 1:
result = string.Format("{0}d", integers[0]); break;
case 2:
result = string.Format("{0}m {1}d", integers[0], integers[1]); break;
case 3:
result = string.Format("{0}y {1}m {2}d", integers[0], integers[1], integers[2]); break;
}
If you want to use regex so bad, that it starts to hurt, you can use this one instead:
var integers = Regex.Matches(str, "\d+").Cast<Match>().Select(x=> int.Parse(x.Value)).ToArray();
But, its overhead, of course. You see, regex is not parsing language, its pattern matching language, and should be used as one. For example, for finding substrings in strings. If you can find final substrings simply by cutting it by char, why not to use it?

DISCLAIMER: I am posting this answer for the educational purposes. The easiest and most correct way in case the whole string represents the time span eocron06's answer is to be used.
The point here is that you have optional parts that go in a specific order. To match them all correctly you may use the following regex:
\b(?:(?:0*(?<h>\d+):)?0*(?<m>\d+):)?0*(?<d>\d+)\b
See the regex demo
Details:
\b - initial word boundary
(?: - start of a non-capturing optional group (see the ? at the end below)
(?:0*(?<h>\d+):)? - a nested non-capturing optional group that matches zero or more zeros (to trim this part from the start from zeros), then captures 1+ digits into Group "h" and matches a :
0*(?<m>\d+): - again, matches zero or more 0s, then captures one or more digits into Group "m"
)? - end of the first optional group
0*(?<d>\d+) - same as the first two above, but captures 1+ digits (days) into Group "d"
\b - trailing word boundary
See the C# demo where the final string is built upon analyzing which group is matched:
using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Text.RegularExpressions;
public class Test
{
public static void Main()
{
var pattern = #"\b(?:(?:0*(?<h>\d+):)?0*(?<m>\d+):)?0*(?<d>\d+)\b";
var strs = new List<string>() {"07", "09:07", "058:09:07" };
foreach (var s in strs)
{
var result = Regex.Replace(s, pattern, m =>
m.Groups["h"].Success && m.Groups["m"].Success ?
string.Format("{0}h {1}m {2}d", m.Groups["h"].Value, m.Groups["m"].Value, m.Groups["d"].Value) :
m.Groups["m"].Success ?
string.Format("{0}m {1}d", m.Groups["m"].Value, m.Groups["d"].Value) :
string.Format("{0}d", m.Groups["d"].Value)
);
Console.WriteLine(result);
}
}
}

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Extracting Titles from strings with RegEx - c#

Related

REGEX Matching string nonconsecutively

Regex.Replace using regular expression as replacement

Extract phone numbers and exclude extraneous characters

using Regex to iterate over a string and search for 3 consecutive hyphens and replace it with [space][hyphen][space]

Regex expression to replace colon grouped integers into years, months, and days and drop leading zeroes

Categories

Resources