I have tried couple of regular expressions but have not been able to come up with one that works correctly. I have string with lines and I want to keep the lines which contains numbers.
Current String
-----------------
Dog
Cat
Cat 1
Dog 22
Once processed the expected result is:
Filtered String
-----------------
Cat 1
Dog 22
myString.Split('\n').Where(s => s.Any(c => Char.IsDigit(c)));
This splits the string by newline ('\n') characters, and for each "line", it finds the ones that have at least one character that is a digit.
Related
I'm trying to create a regex which will extract a complete phone number from a string (which is the only thing in the string) but leaving out any cruft like decorative brackets, etc.
The pattern I have mostly appears to work, but returns a list of matches - whereas I want it to return the phone number with the characters removed. Unfortunately, it completely fails if I add the start and end of line matchers...
^(?!\(\d+\)\s*){1}(?:[\+\d\s]*)$
Without the ^ and $ this matches the following numbers:
12345-678-901 returns three groups: 12345 678 901
+44-123-4567-8901 returns four groups: +44 123 4567 8901
(+48) 123 456 7890 returns four groups: +48 123 456 7890
How can I get the groups to be returned as a single, joined up whole?
Other than that, the only change I would like to include is to return nothing if there are any non-numeric, non-bracket, non-+ characters anywhere. So, this should fail:
(+48) 123 burger 7890
I'd keep it simple, makes it more readable and maintainable:
public string CleanPhoneNumber(string messynumber){
if(Regex.IsMatch(messynumber, "[a-z]"))
return "";
else
return Regex.Replace(messynumber, "[^0-9+]", "");
}
If any alphameric characters are present (extend this range if you wish) return blank else replace every char that is not 0-9 or +, with nothing. This produces output like 0123456789 and +481234567 with all the brackets, spaces and hyphens etc removed too. If you want to keep those in the output, add them to the Regex
Side note: It's not immediately clear or me what you think is "cruft" that should be stripped (non a-z?) and what you think is "cruft" that should cause blank (a-z?). I struggled with this because you said (paraphrase) "non digit, non bracket, non plus should cause blank" but earlier in your examples your processing permitted numbers that had hyphens and also spaces - being strictly demanding of spec hyphens/spaces would be "cruft that causes the whole thing to return blank" too
I've assumed that it's lowercase chars from the "burger" example but as noted you can extend the range in the IF part should you need to include other chars that return blank
If you have a lot of them to do maybe pre compile a regex as a class level variable and use it in the method:
private Regex _strip = new Regex( "[^0-9+]", RegexOptions.Compiled);
public string CleanPhoneNumber(string messynumber){
if(Regex.IsMatch(messynumber, "[a-z]"))
return "";
else
return _strip.Replace(messynumber, "");
}
...
for(int x = 0; x < millionStrArray.Length; x++)
millionStrArray[x] = CleanPhoneNumber(millionStrArray[x], "");
I don't think you'll gain much from compiling the IsMatch one but you could try it in a similar pattern
Other options exist if you're avoiding regex, you cold even do it using LINQ, or looping on char arrays, stringbuilders etc. Regex is probably the easiest in terms of short maintainable code
The strategy here is to use a look ahead and kick out (fail) a match if word characters are found.
Then when there are no characters, it then captures the + and all numbers into a match group named "Phone". We then extract that from the match's "Phone" capture group and combine as such:
string pattern = #"
^
(?=[\W\d+\s]+\Z) # Only allows Non Words, decimals and spaces; stop match if letters found
(?<Phone>\+?) # If a plus found at the beginning; allow it
( # Group begin
(?:\W*) # Match but don't *capture* any non numbers
(?<Phone>[\d]+) # Put the numbers in.
)+ # 1 to many numbers.
";
var number = "+44-123-33-8901";
var phoneNumber =
string.Join(string.Empty,
Regex.Match(number,
pattern,
RegexOptions.IgnorePatternWhitespace // Allows us to comment the pattern
).Groups["Phone"]
.Captures
.OfType<Capture>()
.Select(cp => cp.Value));
// phoneNumber is `+44123338901`
If one looks a the match structure, the data it houses is this:
Match #0
[0]: +44-123-33-8901
["1"] → [1]: -8901
→1 Captures: 44, -123, -33, -8901
["Phone"] → [2]: 8901
→2 Captures: +, 44, 123, 33, 8901
As you can see match[0] contains the whole match, but we only need the captures under the "Phone" group. With those captures { +, 44, 123, 33, 8901 } we now can bring them all back together by the string.Join.
I want to check if an input string follows a pattern and if it does extract information from it.
My pattern is like this Episode 000 (Season 00). The 00s are numbers that can range from 0-9. Now I want to check if this input Episode 094 (Season 02) matches this pattern and because it does it should then extract those two numbers, so I end up with two integer variables 94 & 2:
string latestFile = "Episode 094 (Season 02)";
if (!Regex.IsMatch(latestFile, #"^(Episode)\s[0-9][0-9][0-9]\s\((Season)\s[0-9][0-9]\)$"))
return
int Episode = Int32.Parse(Regex.Match(latestFile, #"\d+").Value);
int Season = Int32.Parse(Regex.Match(latestFile, #"\d+").Value);
The first part where I check if the overall string matches the pattern works, but I think it can be improved. For the second part, where I actually extract the numbers I'm stuck and what I posted above obviously doesn't works, because it grabs all digits from the string. So if anyone of you could help me figure out how to only extract the three number characters after Episode and the two characters after Season that would be great.
^Episode (\d{1,3}) \(Season (\d{1,2})\)$
Captures the 2 numbers (even with length 1 to 3/2) and gives them back as a group.
You can go even further and name your groups:
^Episode (?<episode>\d{1,3}) \(Season (?<season>\d{1,2})\)$
and then call them.
Example for using groups:
string pattern = #"abc(?<firstGroup>\d{1,3})abc";
string input = "abc234abc";
Regex rgx = new Regex(pattern);
Match match = rgx.Match(input);
string result = match.Groups["firstGroup"].Value; //=> 234
You can see what the expressions mean and test them here
In your regex ^(Episode)\s[0-9][0-9][0-9]\s\((Season)\s[0-9][0-9]\)$ you are capturing Episode and Season in a capturing group, but what you actually want to capture is the digits. You could switch your capturing groups like this:
^Episode\s([0-9][0-9][0-9])\s\(Season\s([0-9][0-9])\)$
Matching 3 digits in this way [0-9][0-9][0-9] can be written as \d{3} and [0-9][0-9] as \d{2}.
That would look like ^Episode\s(\d{3})\s\(Season\s(\d{2})\)$
To match one or more digits you could use \d+.
The \s is a matches a whitespace character. You could use \s or a whitespace.
Your regex could look like:
^Episode (\d{3}) \(Season (\d{2})\)$
string latestFile = "Episode 094 (Season 02)";
GroupCollection groups = Regex.Match(latestFile, #"^Episode (\d{3}) \(Season (\d{2})\)$").Groups;
int Episode = Int32.Parse(groups[1].Value);
int Season = Int32.Parse(groups[2].Value);
Console.WriteLine(Episode);
Console.WriteLine(Season);
That would result in:
94
2
Demo C#
I have a strings with the form:
5 dogs = 1 medium size house
4 cats = 2 small houses
one bird = 1 bird cage
What I amt trying to do is remove the substring that exists before the equals sign but only if the substring contains a keyword and the data before that keyword is a integer.
So in this example my key words are:
dogs,
cats,
bird
In the above example, the ideal output of my process would be:
1 medium size house
2 small houses
one bird = 1 bird cage
My code so far looks like this (I am hard coding the keyword values/strings for now)
var orginalstring= "5 dogs = 1 medium size house";
int equalsindex = originalstring.indexof('=');
var prefix = originalstring.Substring(0,equalsindex);
if(prefix.Contains("dogs")
{
var modifiedstring = originalstring.Remove(prefix).Replace("=", string.empty);
return modifiedstring;
}
return originalstring;
The issue here is that I am removing the whole substring regardless of whether or not the data preceding the keyword is a number.
Would somebody be able to help me with this additional logic?
Thanks so much as always for anybody who takes a few minutes to read this question.
Mick
You can do it with a simple regex of the form
\d+\s+(?:kw1|kw2|kw3|...)\s*=\s*
where kwX is the corresponding keyword.
var data = new[] {
"5 dogs = 1 medium size house",
"4 cats = 2 small houses",
"one bird = 1 bird cage"
};
var keywords = new[] {"dogs", "cats", "bird"};
var regexStr = string.Format( #"\d+\s+(?:{0})\s*=\s*", string.Join("|", keywords));
var regex = new Regex(regexStr);
foreach (var s in data) {
Console.WriteLine("'{0}'", regex.Replace(s, string.Empty));
}
In the example above the call of string.Format pastes the list of keywords joined by | into the "template" of the expression at the top of the post, i.e.
\d+\s+(?:dogs|cats|bird)\s*=\s*
This expression matches
One or more digits \d+, followed by
One or more space \s+, followed by
A keyword from the list: dogs, cats, bird (?:dogs|cats|bird), followed by
Zero or more spaces \s*, followed by
An equal sign =, followed by
Zero or more spaces \s*
The rest is easy: since this regex matches the part that you wish to remove, you need to call Replace and pass it string.Empty.
Demo.
You can use regex (System.Text.RegularExpressions) to identify whether or not there is a number in the string.
Regex r = new Regex("[0-9]"); //Look for a number between 0 and 9
bool hasNumber = r.IsMatch(prefix);
This Regex simply searches for any number in the string. If you want to search for a number-space-string you could use [0-9] [a-z]|[A-Z]. The | is an "or" so that both upper and lower case letters result in a match.
You can try something like this:
int i;
if(int.TryParse(prefix.Substring(0, 1), out i)) //try to get an int from first char of prefix
{
//remove prefix
}
This will only work for single-digit integers, however.
Ive seen a few answers that are similar but none seem to go far enough. I need to split the string when the letters change to numbers and back. The trick is the pattern is variable meaning there can be any number of letter or number groupings.
For Example
AB1000 => AB 1000
ABC1500 => ABC 1500
DE160V1 => DE 160 V 1
FGG217H5IJ1 => FGG 217 H 5 IJ 1
Etc.
If you want to split the string, one way would be lookarounds:
string[] results = Regex.Split("FGG217H5IJ1", #"(?<=\d)(?=\D)|(?<=\D)(?=\d)");
Console.WriteLine(String.Join(" ", results)); //=> "FGG 217 H 5 IJ 1"
You can use a regex like this:
[A-Z]+|\d+
Working demo
I'm facing a problem caused by having to extract titles of programs from small pieces of strings whose structure can't be predicted at all. There are some patterns like you can see below, and each string must be evaluated to see if it matches any of those structures to get me able to properly get the title.
I've bought Mastering Regular Expressions but the time that I have to accomplish this doesn't allow me to be studing the book and trying to get the necessary introduction to this (interesting but particular) Theme.
Perharps, someone experienced in this area could help me to understand how to accomplish this job?
Some random Name 2 - Ep.1
=> Some random Name 2
Some random Name - Ep.1
=> Some random Name
Boff another 2 name! - Ep. 228
=> Boff another 2 name!
Another one & the rest - T1 Ep. 2
=>Another one & the rest
T5 - Ep. 2 Another Name
=> Another Name
T3 - Ep. 3 - One More with an Hyfen
=> One More with an Hyfen
Another one this time with a Date - 02/12/2012
=>Another one this time with a Date
10 Aug 2012 - Some Other 2 - Ep. 2
=> Some Other 2
Ep. 93 - Some program name
=> Some Program name
Someother random name - Epis. 1 e 2
=> Someother random name
The Last one with something inside parenthesis (V.O.)
=> The Last one with something inside parenthesis
As you may see the titles that I want to extract from the given string may have Numbers, special characters like &, and characters from a-zA-Z (i guess that's all)
The complex part comes when having to know if it has one space or more after the title and is followed by a hyphen and if it haves zero or more spaces until Ep. (i can't explain this, it's just complex.)
This program will handle your cases. The main principle is that it removes a certain sequence if present in the beginnign or the end of the string. You'll have to maintain the list of regular expressions if the format of the strings you want to remove will change or change the order of them as needed.
using System;
using System.Text.RegularExpressions;
public class MyClass
{
static string [] strs =
{
"Some random Name 2 - Ep.1",
"Some random Name - Ep.1",
"Boff another 2 name! - Ep. 228",
"Another one & the rest - T1 Ep. 2",
"T5 - Ep. 2 Another Name",
"T3 - Ep. 3 - One More with an Hyfen",
#"Another one this time with a Date - 02/12/2012",
"10 Aug 2012 - Some Other 2 - Ep. 2",
"Ep. 93 - Some program name",
"Someother random name - Epis. 1 e 2",
"The Last one with something inside parenthesis (V.O.)"};
static string [] regexes =
{
#"T\d+",
#"\-",
#"Ep(i(s(o(d(e)?)?)?)?)?\s*\.?\s*\d+(\s*e\s*\d+)*",
#"\d{2}\/\d{2}\/\d{2,4}",
#"\d{2}\s*[A-Z]{3}\s*\d{4}",
#"T\d+",
#"\-",
#"\!",
#"\(.+\)",
};
public static void Main()
{
foreach(var str in strs)
{
string cleaned = str.Trim();
foreach(var cleaner in regexes)
{
cleaned = Regex.Replace(cleaned, "^" + cleaner, string.Empty, RegexOptions.IgnoreCase).Trim();
cleaned = Regex.Replace(cleaned, cleaner + "$", string.Empty, RegexOptions.IgnoreCase).Trim();
}
Console.WriteLine(cleaned);
}
Console.ReadKey();
}
If it's only about checking for patterns, and not actually extracting the title name, let me have a go:
With #"Ep(is)?\.?\s*\d+" you can check for strings such as "Ep1", "Ep01", "Ep.999", "Ep3", "Epis.0", "Ep 11" and similar (it also detects multiple whitespaces between Ep and the numeral).
You may want to use the RegexOptions.IgnoreCase in case you want to match "ep1" as well as "Ep1" or "EP1"
If you are certain, that no name will include a "-" and that this character separates name from episode-info, you can try to split the string like this:
string[] splitString = inputString.Split(new char[] {'-'});
foreach (string s in splitString)
{
s.Trim() // removes all leading or trailing whitespaces
}
You'll have the name in either splitString[0] or splitString[1] and the episode-info in the other.
To search for dates, you can use this: #"\d{1,4}(\\|/|.|,)\d{1,2}(\\|/|.|,)\d{1,4}" which can detect dates with the year to the front or the back written with 1 to 4 decimals (except for the center value, which can be 1 to 2 decimals long) and separated with a back-slash, a slash, a comma or a dot.
Like I mentioned before: this will not allow your program to extract the actual title, only to find out if such strings exist (those strings may still be part of the title itself)
Edit:
A way to get rid of multiple whitespaces is to use inputString = Regex.Replace(inputString, "\s+", " ") which replaces multiple whitespaces with a single whitespace. Maybe you have underscores instead of whitespaces? Such as: "This_is_a_name", in which case you might want to use inputString = Regex.Replace(inputString, "_+", " ") before removing the multiple whitespaces.