How can I split part of a string that is inconsistent? - c#

I have the following string:
01-21-27-0000-00-048 and it is easy to split it apart because each section is separated by a -, but sometimes this string is represented as 01-21-27-0000-00048, so splitting it is not as easy because the last 2 parts are combined. How can I handle this? Also, what about the case where it might be something like 01-21-27-0000-00.048
In case anyone is curious, this is a parcel number and it varies from county to county and a county can have 1 format or they can have 100 formats.

This is a very good case for using regular expressions. You string matches the following regexp:
(\d{2})-(\d{2})-(\d{2})-(\d{4})-(\d{2})[.-]?(\d{3})
Match the input against this expression, and harvest the six groups of digits from the match:
var str = new[] {
"01-21-27-0000-00048", "01-21-27-0000-00.048", "01-21-27-0000-00-048"
};
foreach (var s in str) {
var m = Regex.Match(s, #"(\d{2})-(\d{2})-(\d{2})-(\d{4})-(\d{2})[.-]?(\d{3})");
for (var i = 1 /* one, not zero */ ; i != m.Groups.Count ; i++) {
Console.Write("{0} ", m.Groups[i]);
}
Console.WriteLine();
}
If you would like to allow for other characters, say, letters in the segments that are separated by dashes, you could use \w instead of \d to denote a letter, a digit, or an underscore. If you would like to allow an unspecified number of such characters within a known range, say, two to four, you can use {2,4} in the regexp instead of the more specific {2}, which means "exactly two". For example,
(\w{2,3})-(\w{2})-(\w{2})-(\d{4})-(\d{2})[.-]?(\d{3})
lets the first segment contain two to three digits or letters, and also allow for letters in segments two and three.

Normalize the string first.
I.e. if you know that the last part is always three characters, then insert a - as the fourth-to-last character, then split the resultant string. Along the same line, convert the dot '.' to a dash '-' and split that string.

Replace all the char which are not digit with emptyString('').
then any of your string become in the format like
012127000000048
now you can use the divide it in (2, 2, 2, 4, 2, 3) parts.

Related

REGEX Matching string nonconsecutively

I'm trying to understand how to match a specific string that's held within an array (This string will always be 3 characters long, ex: 123, 568, 458 etc) and I would match that string to a longer string of characters that could be in any order (9841273 for example). Is it possible to check that at least 2 of the 3 characters in the string match (in this example) strMoves? Please see my code below for clarification.
private readonly string[] strSolutions = new string[8] { "123", "159", "147", "258", "357", "369", "456", "789" };
Private Static string strMoves = "1823742"
foreach (string strResult in strSolutions)
{
Regex rgxMain = new Regex("[" + strMoves + "]{2}");
if (rgxMain.IsMatch(strResult))
{
MessageBox.Show(strResult);
}
}
The portion where I have designated "{2}" in Regex is where I expected the result to check for at least 2 matching characters, but my logic is definitely flawed. It will return true IF the two characters are in consecutive order as compared to the string in strResult. If it's not in the correct order it will return false. I'm going to continue to research on this but if anyone has ideas on where to look in Microsoft's documentation, that would be greatly appreciated!
Correct order where it would return true: "144257" when matched to "123"
incorrect order: "35718" when matched to "123"
The 3 is before the 1, so it won't match.
You can use the following solution if you need to find at least two different not necessarily consecutive chars from a specified set in a longer string:
new Regex($#"([{strMoves}]).*(?!\1)[{strMoves}]", RegexOptions.Singleline)
It will look like
([1823742]).*(?!\1)[1823742]
See the regex demo.
Pattern details:
([1823742]) - Capturing group 1: one of the chars in the character class
.* - any zero or more chars as many as possible (due to RegexOptions.Singleline, . matches any char including newline chars)
(?!\1) - a negative lookahead that fails the match if the next char is a starting point of the value stored in the Group 1 memory buffer (since it is a single char here, the next char should not equal the text in Group 1, one of the specified digits)
[1823742] - one of the chars in the character class.

Extract phone numbers and exclude extraneous characters

I'm trying to create a regex which will extract a complete phone number from a string (which is the only thing in the string) but leaving out any cruft like decorative brackets, etc.
The pattern I have mostly appears to work, but returns a list of matches - whereas I want it to return the phone number with the characters removed. Unfortunately, it completely fails if I add the start and end of line matchers...
^(?!\(\d+\)\s*){1}(?:[\+\d\s]*)$
Without the ^ and $ this matches the following numbers:
12345-678-901 returns three groups: 12345 678 901
+44-123-4567-8901 returns four groups: +44 123 4567 8901
(+48) 123 456 7890 returns four groups: +48 123 456 7890
How can I get the groups to be returned as a single, joined up whole?
Other than that, the only change I would like to include is to return nothing if there are any non-numeric, non-bracket, non-+ characters anywhere. So, this should fail:
(+48) 123 burger 7890
I'd keep it simple, makes it more readable and maintainable:
public string CleanPhoneNumber(string messynumber){
if(Regex.IsMatch(messynumber, "[a-z]"))
return "";
else
return Regex.Replace(messynumber, "[^0-9+]", "");
}
If any alphameric characters are present (extend this range if you wish) return blank else replace every char that is not 0-9 or +, with nothing. This produces output like 0123456789 and +481234567 with all the brackets, spaces and hyphens etc removed too. If you want to keep those in the output, add them to the Regex
Side note: It's not immediately clear or me what you think is "cruft" that should be stripped (non a-z?) and what you think is "cruft" that should cause blank (a-z?). I struggled with this because you said (paraphrase) "non digit, non bracket, non plus should cause blank" but earlier in your examples your processing permitted numbers that had hyphens and also spaces - being strictly demanding of spec hyphens/spaces would be "cruft that causes the whole thing to return blank" too
I've assumed that it's lowercase chars from the "burger" example but as noted you can extend the range in the IF part should you need to include other chars that return blank
If you have a lot of them to do maybe pre compile a regex as a class level variable and use it in the method:
private Regex _strip = new Regex( "[^0-9+]", RegexOptions.Compiled);
public string CleanPhoneNumber(string messynumber){
if(Regex.IsMatch(messynumber, "[a-z]"))
return "";
else
return _strip.Replace(messynumber, "");
}
...
for(int x = 0; x < millionStrArray.Length; x++)
millionStrArray[x] = CleanPhoneNumber(millionStrArray[x], "");
I don't think you'll gain much from compiling the IsMatch one but you could try it in a similar pattern
Other options exist if you're avoiding regex, you cold even do it using LINQ, or looping on char arrays, stringbuilders etc. Regex is probably the easiest in terms of short maintainable code
The strategy here is to use a look ahead and kick out (fail) a match if word characters are found.
Then when there are no characters, it then captures the + and all numbers into a match group named "Phone". We then extract that from the match's "Phone" capture group and combine as such:
string pattern = #"
^
(?=[\W\d+\s]+\Z) # Only allows Non Words, decimals and spaces; stop match if letters found
(?<Phone>\+?) # If a plus found at the beginning; allow it
( # Group begin
(?:\W*) # Match but don't *capture* any non numbers
(?<Phone>[\d]+) # Put the numbers in.
)+ # 1 to many numbers.
";
var number = "+44-123-33-8901";
var phoneNumber =
string.Join(string.Empty,
Regex.Match(number,
pattern,
RegexOptions.IgnorePatternWhitespace // Allows us to comment the pattern
).Groups["Phone"]
.Captures
.OfType<Capture>()
.Select(cp => cp.Value));
// phoneNumber is `+44123338901`
If one looks a the match structure, the data it houses is this:
Match #0
[0]: +44-123-33-8901
["1"] → [1]: -8901
→1 Captures: 44, -123, -33, -8901
["Phone"] → [2]: 8901
→2 Captures: +, 44, 123, 33, 8901
As you can see match[0] contains the whole match, but we only need the captures under the "Phone" group. With those captures { +, 44, 123, 33, 8901 } we now can bring them all back together by the string.Join.

Get sub-strings from a string that are enclosed using some specified character

Suppose I have a string
Likes (20)
I want to fetch the sub-string enclosed in round brackets (in above case its 20) from this string. This sub-string can change dynamically at runtime. It might be any other number from 0 to infinity. To achieve this my idea is to use a for loop that traverses the whole string and then when a ( is present, it starts adding the characters to another character array and when ) is encountered, it stops adding the characters and returns the array. But I think this might have poor performance. I know very little about regular expressions, so is there a regular expression solution available or any function that can do that in an efficient way?
If you don't fancy using regex you could use Split:
string foo = "Likes (20)";
string[] arr = foo.Split(new char[]{ '(', ')' }, StringSplitOptions.None);
string count = arr[1];
Count = 20
This will work fine regardless of the number in the brackets ()
e.g:
Likes (242535345)
Will give:
242535345
Works also with pure string methods:
string result = "Likes (20)";
int index = result.IndexOf('(');
if (index >= 0)
{
result = result.Substring(index + 1); // take part behind (
index = result.IndexOf(')');
if (index >= 0)
result = result.Remove(index); // remove part from )
}
Demo
For a strict matching, you can do:
Regex reg = new Regex(#"^Likes\((\d+)\)$");
Match m = reg.Match(yourstring);
this way you'll have all you need in m.Groups[1].Value.
As suggested from I4V, assuming you have only that sequence of digits in the whole string, as in your example, you can use the simpler version:
var res = Regex.Match(str,#"\d+")
and in this canse, you can get the value you are looking for with res.Value
EDIT
In case the value enclosed in brackets is not just numbers, you can just change the \d with something like [\w\d\s] if you want to allow in there alphabetic characters, digits and spaces.
Even with Linq:
var s = "Likes (20)";
var s1 = new string(s.SkipWhile(x => x != '(').Skip(1).TakeWhile(x => x != ')').ToArray());
const string likes = "Likes (20)";
int likesCount = int.Parse(likes.Substring(likes.IndexOf('(') + 1, (likes.Length - likes.IndexOf(')') + 1 )));
Matching when the part in paranthesis is supposed to be a number;
string inputstring="Likes (20)"
Regex reg=new Regex(#"\((\d+)\)")
string num= reg.Match(inputstring).Groups[1].Value
Explanation:
By definition regexp matches a substring, so unless you indicate otherwise the string you are looking for can occur at any place in your string.
\d stand for digits. It will match any single digit.
We want it to potentially be repeated several times, and we want at least one. The + sign is regexp for previous symbol or group repeated 1 or more times.
So \d+ will match one or more digits. It will match 20.
To insure that we get the number that is in paranteses we say that it should be between ( and ). These are special characters in regexp so we need to escape them.
(\d+) would match (20), and we are almost there.
Since we want the part inside the parantheses, and not including the parantheses we tell regexp that the digits part is a single group.
We do that by using parantheses in our regexp. ((\d+)) will still match (20), but now it will note that 20 is a subgroup of this match and we can fetch it by Match.Groups[].
For any string in parantheses things gets a little bit harder.
Regex reg=new Regex(#"\((.+)\)")
Would work for many strings. (the dot matches any character) But if the input is something like "This is an example(parantesis1)(parantesis2)", you would match (parantesis1)(parantesis2) with parantesis1)(parantesis2 as the captured subgroup. This is unlikely to be what you are after.
The solution can be to do the matching for "any character exept a closing paranthesis"
Regex reg=new Regex(#"\(([^\(]+)\)")
This will find (parantesis1) as the first match, with parantesis1 as .Groups[1].
It will still fail for nested paranthesis, but since regular expressions are not the correct tool for nested paranthesis I feel that this case is a bit out of scope.
If you know that the string always starts with "Likes " before the group then Saves solution is better.

Removing words with special characters in them

I have a long string composed of a number of different words.
I want to go through all of them, and if the word contains a special character or number (except '-'), or starts with a Capital letter, I want to delete it (the whole word not just that character). For all intents and purposes 'foreign' letters can count as special characters.
The obvious solution is to run a loop through each word (after splitting it) and then a loop through each character - but I'm hoping there's a faster way of doing it? Perhaps using Regex but I've almost no experience with it.
Thanks
ADDED:
(What I want for example:)
Input: "this Is an Example of 5 words in an input like-so from example.com"
Output: {this,an,of,words,in,an,input,like-so,from}
(What I've tried so far)
List<string> response = new List<string>();
string[] splitString = text.Split(' ');
foreach (string s in splitString)
{
bool add = true;
foreach (char c in s.ToCharArray())
{
if (!(c.Equals('-') || (Char.IsLetter(c) && Char.IsLower(c))))
{
add = false;
break;
}
if (add)
{
response.Add(s);
}
}
}
Edit 2:
For me a word should be a number of characters (a..z) seperated by a space. ,/./!/... at the end shouldn't count for the 'special character' condition (which is really mostly just to remove urls or the like)
So:
"I saw a dog. It was black!"
should result in
{saw,a,dog,was,black}
So you want to find all "words" that only contain characters a-z or -, for words that are separated by spaces?
A regex like this will find such words:
(?<!\S)[a-z-]+(?!\S)
To also allow for words that end with single punctuation, you could use:
(?<!\S)[a-z-]+(?=[,.!?:;]?(?!\S))
Example (ideone):
var re = #"(?<!\S)[a-z-]+(?=[,.!?:;]?(?!\S))";
var str = "this, Is an! Example of 5 words in an input like-so from example.com foo: bar?";
var m = Regex.Matches(str, re);
Console.WriteLine("Matched: ");
foreach (Match i in m)
Console.Write(i + " ");
Notice the punctuation in the string.
Output:
Matched:
this an of words in an input like-so from foo bar
How about this?
(?<=^|\s+)(?[a-z-]+)(?=$|\s+)
Edit: Meant (?<=^|\s+)(?<word>[a-z\-]+)(?=(?:\.|,|!|\.\.\.)?(?:$|\s+))
Rules:
Word can only be preceded by start of line or some number of whitespace characters
Word can only be followed by end of line or some number of whitespace characters (Edit supports words ending with periods, commas, exclamation points, and ellipses)
Word can only contain lower case (latin) letters and dashes
The named group containing each word is "word"
Have a look at Microsoft's How to: Search Strings Using Regular Expressions (C# Programming Guide) - it's about regexes in C#.
List<string> strings = new List<string>() {"asdf", "sdf-sd", "sdfsdf"};
for (int i = strings.Count-1; i > 0; i--)
{
if (strings[i].Contains("-"))
{
strings.Remove(strings[i]);
}
}
This could be a starting point. right now it just checks only for "." as a special char. This outputs : "this an of words in an like-so from"
string pattern = #"[A-Z]\w+|\w*[0-9]+\w*|\w*[\.]+\w*";
string line = "this Is an Example of 5 words in an in3put like-so from example.com";
System.Text.RegularExpressions.Regex r = new System.Text.RegularExpressions.Regex(pattern);
line = r.Replace(line,"");
You can do this in two ways, the white-list way and the black-list way. With a white-list you define the set of characters that you consider to be acceptable and with the black-list its the opposite.
Lets assume the white-list way and that you accept only characters a-z, A-Z and the - character. Additionally you have the rule that the first character of a word cannot be an upper case character.
With this you can do something like this:
string target = "This is a white-list example: (Foo, bar1)";
var matches = Regex.Matches(target, #"(?:\b)(?<Word>[a-z]{1}[a-zA-Z\-]*)(?:\b)");
string[] words = matches.Cast<Match>().Select(m => m.Value).ToArray();
Console.WriteLine(string.Join(", ", words));
Outputs:
// is, a, white-list, example
You can use look-aheads and look-behinds to do this. Here's a regex that matches your example:
(?<=\s|^)[a-z-]+(?=\s|$)
The explanation is: match one or more alphabetic characters (lowercase only, plus hyphen), as long as what comes before the characters is whitespace (or the start of the string), and as long as what comes after is whitespace or the end of the string.
All you need to do now is plug that into System.Text.RegularExpressions.Regex.Matches(input, regexString) to get your list of words.
Reference: http://www.mikesdotnetting.com/Article/46/CSharp-Regular-Expressions-Cheat-Sheet

How to extract decimal number from string in C#

string sentence = "X10 cats, Y20 dogs, 40 fish and 1 programmer.";
string[] digits = Regex.Split (sentence, #"\D+");
For this code I get these values in the digits array
10,20,40,1
string sentence = "X10.4 cats, Y20.5 dogs, 40 fish and 1 programmer.";
string[] digits = Regex.Split (sentence, #"\D+");
For this code I get these values in the digits array
10,4,20,5,40,1
But I would like to get like
10.4,20.5,40,1
as decimal numbers. How can I achieve this?
Small improvement to #Michael's solution:
// NOTES: about the LINQ:
// .Where() == filters the IEnumerable (which the array is)
// (c=>...) is the lambda for dealing with each element of the array
// where c is an array element.
// .Trim() == trims all blank spaces at the start and end of the string
var doubleArray = Regex.Split(sentence, #"[^0-9\.]+")
.Where(c => c != "." && c.Trim() != "");
Returns:
10.4
20.5
40
1
The original solution was returning
[empty line here]
10.4
20.5
40
1
.
The decimal/float number extraction regex can be different depending on whether and what thousand separators are used, what symbol denotes a decimal separator, whether one wants to also match an exponent, whether or not to match a positive or negative sign, whether or not to match numbers that may have leading 0 omitted, whether or not extract a number that ends with a decimal separator.
A generic regex to match the most common decimal number types is provided in Matching Floating Point Numbers with a Regular Expression:
[-+]?[0-9]*\.?[0-9]+(?:[eE][-+]?[0-9]+)?
I only changed the capturing group to a non-capturing one (added ?: after (). It matches
If you need to make it even more generic, if the decimal separator can be either a dot or a comma, replace \. with a character class (or a bracket expression) [.,]:
[-+]?[0-9]*[.,]?[0-9]+(?:[eE][-+]?[0-9]+)?
^^^^
Note the expressions above match both integer and floats. To match only float/decimal numbers make sure the fractional pattern part is obligatory by removing the second ? after \. (demo):
[-+]?[0-9]*\.[0-9]+(?:[eE][-+]?[0-9]+)?
^
Now, 34 is not matched: is matched.
If you do not want to match float numbers without leading zeros (like .5) make the first digit matching pattern obligatory (by adding + quantifier, to match 1 or more occurrences of digits):
[-+]?[0-9]+\.[0-9]+(?:[eE][-+]?[0-9]+)?
^
See this demo. Now, it matches much fewer samples:
Now, what if you do not want to match <digits>.<digits> inside <digits>.<digits>.<digits>.<digits>? How to match them as whole words? Use lookarounds:
[-+]?(?<!\d\.)\b[0-9]+\.[0-9]+(?:[eE][-+]?[0-9]+)?\b(?!\.\d)
And a demo here:
Now, what about those floats that have thousand separators, like 12 123 456.23 or 34,345,767.678? You may add (?:[,\s][0-9]+)* after the first [0-9]+ to match zero or more sequences of a comma or whitespace followed with 1+ digits:
[-+]?(?<![0-9]\.)\b[0-9]+(?:[,\s][0-9]+)*\.[0-9]+(?:[eE][-+]?[0-9]+)?\b(?!\.[0-9])
See the regex demo:
Swap a comma with \. if you need to use a comma as a decimal separator and a period as as thousand separator.
Now, how to use these patterns in C#?
var results = Regex.Matches(input, #"<PATTERN_HERE>")
.Cast<Match>()
.Select(m => m.Value)
.ToList();
try
Regex.Split (sentence, #"[^0-9\.]+")
You'll need to allow for decimal places in your regular expression. Try the following:
\d+(\.\d+)?
This will match the numbers rather than everything other than the numbers, but it should be simple to iterate through the matches to build your array.
Something to keep in mind is whether you should also be looking for negative signs, commas, etc.
Check the syntax lexers for most programming languages for a regex for decimals.
Match that regex to the string, finding all matches.
If you have Linq:
stringArray.Select(s=>decimal.Parse(s));
A foreach would also work. You may need to check that each string is actually a number (.Parse does not throw en exception).
Credit for following goes to #code4life. All I added is a for loop for parsing the integers/decimals before returning.
public string[] ExtractNumbersFromString(string input)
{
input = input.Replace(",", string.Empty);
var numbers = Regex.Split(input, #"[^0-9\.]+").Where(c => !String.IsNullOrEmpty(c) && c != ".").ToArray();
for (int i = 0; i < numbers.Length; i++)
numbers[i] = decimal.Parse(numbers[i]).ToString();
return numbers;
}

Categories