Using Regex.Split to remove anything non numeric and splitting on -

Using Regex.Split to remove anything non numeric and splitting on - - c#

I'm not sure why but for some reason The Regex Split method is going over my head. I'm trying to look through tutorials for what I need and can't seem to find anything.
I simply am reading an excel doc and want to format a string such as $145,000-$179,999 to give me two strings. 145000 and 179999. At the same time I'd like to prune a string such as '$180,000-Limit to simply 180000.
var loanLimits = Regex.Matches(Result.Rows[row + 2 + i][column].ToString(), #"\d+");
The above code seems to chop '$145,000-$179,999 up into 4 parts: 145, 000, 179, 999. Any ideas on how to achieve what I'm asking?

Regular expressions match exactly character by character (there's no knowledge of the concept of a "number" or a "word" in regular expressions - you have to define that yourself in your expression). The expression you are using, \d+, uses the character class \d, which means any digit 0-9 (and + means match one or more). So in the expression $145,000, notice that the part you are looking for is not just composed of digits; it also includes commas. So the regular expression finds every continuous group of characters that matches your regular expression, which are the four groups of numbers.
There are a couple of ways to approach the problem.
Include , in your regular expression, so (\d|,)+, which means match as many characters in a row that are either a digit or a comma. There will be two matches: 145,000 and 179,999, from which you can further remove the commas with myStr.Replace(",", ""). (DEMO)
Do as you say in the title, and remove all non-numeric characters. So you could use Regex.Replace with the expression [^\d-]+ - which means match anything that is not a digit or a hyphen - and then replace those with "". Then the result would be 145000-179999, which you can split with a simple non-regular-expression split, myStr.Split('-'), to get your two parts. (DEMO)
Note that for your second example ($180,000-Limit), you'll need an extra check to count the number of results returned from Match in the first example, and Split in the second example to determine whether there were two numbers in the range, or only a single number.

you can try to treat each string separately by spiting it based on - and extraction only numbers from it
ArrayList mystrings = new ArrayList();
List<string> myList = Result.Rows[row + 2 + i][column].ToString().Split('-').ToList();
foreach(var item in myList)
{
string result = Regex.Replace(item, #"[^\d]", "");
mystrings.Add(result);
}

An alternative to using RegEx is to use the built in string and char methods in the DotNet framework. Assuming the input string will always have a single hypen:
string input = "$145,000-$179,999";
var split = input.Split( '-' )
.Select( x => string.Join( "", x.Where( char.IsLetterOrDigit ) ) )
.ToList();
string first = split.First(); //145000
string second = split.Last(); //179999
first you split the string using the standard Split method
then you create a new string by selectively taking only Letters or Digits from each item in the collection: x.Where...
then you join the string using the standard Join method
finally, take the first and last item in the collection for your 2 strings.

Related

Regex Spilt based on multiple delimiters in C#

I have a string of type "KeyOperatorValue1,Value2,Value2....". For e.g = "version>=5", "lang=en,fr,es" etc and currently, the possible value for operator field is "=", "!=", ">", ">=", "<", "<=", but I don't want it to be limited to them only. Now the problem is given such a string, how can I split into a triplet?
Since, all the operator's string representation are not mutually exclusive("=" is a subset of ">="), I can't use public string[] Split(string[] separator, StringSplitOptions options) and the Regex.Split doesn't have a variant which takes multiple regex as parameters.

Since you have not mentioned the format of your input I have made certain assumptions..
I have assumed that
key would always contains alphanumeric characters
values would always be alphanumeric characters optionally separated by ,
key-value pair would be separated by non word characters
(?<key>\w+)(?<operand>[^\w,]+)(?<value>[\w,]+)
So this would match a string as operand if its not , or any one of [a-zA-Z\d_]
You can use this code
var lst=Regex.Matches(input,regex)
.Cast<Match>()
.Select(x=>new{
key=x.Groups["key"].Value,
operand=x.Groups["operand"].Value,
value=x.Groups["value"].Value
});
You can now iterate over lst
foreach(var l in lst)
{
l.key;
l.operand;
l.value;
}

Regex has "or" operator (separators will be included in the result though):
Regex.Split(#sourceString, #"(>=)|(<=)|(!=)|(=)|(>)|(<)");

You don't have to use regular expressions to accomplish that. Simply store the operators in an array. Keep the array sorted by the length of the operators. Iterate over the operators and get the position of the operator using IndexOf(). Now you can use Substring() to extract the key and the values from your input string.

You can just use branching to provide multiple alternatives. There are multiple possibilities to achieve this, one example would be this:
(\w+)([!<>]?=|[<>])(.*)
As you can see this expression contains three separate capture groups:
(\w+?): This will match "word" character (alphanumerical and underscores), as long as the sequence is at least one character long (+).
([!<>]?=|[<>]): This expression matches the operators given in your example. The first half ([!<>]?=) will match any of the characters inside [] (or skip it (?)) followed by =. The alternative simply matches < or >.
(.*): This will match any character (or nothing), whatever follows till the end of the string/line.
So when you match the expression, you'll get a total of 4 (sub) matches:
1: The name of the key.
2: The operator used.
3: The actual value given.
Edit:
If you'd like to match other operators as well, you'd have to add them as additional branches in the second matching group:
(\w+)([!<>]?=|[<>]|HERE)(.*)
Just keep in mind that there's in general no 100% perfect way to match any operator without defining the exact characters that should be considered valid operands (or components of an operand).

Regular Expression for splitting string by number of characters

I have a 2D Barcode that I need to be parsed into two different items. I want my first expression to read the first 10 characters (numbers and letters) only. The second expression I want the first 10 characters to be ignored and then read the remaining characters (numbers, letters, _ ). The total number of characters remaing are not consistant.
Here is a sample of what the barcode reads. 20P0000002_0_DP-3_TR_DEBIT
Any suggestions?

You don't need regular expressions, String.Substring will do:
var first = barcode.Substring(0, 10);
var second = barcode.Substring(10);
You can then check if the first part is just letters and numbers with the nice but not theoretically 100% accurate
var isValid = first.All(char.IsLetterOrDigit);
or with the more prosaic
var acceptable = "ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789";
var isValid = first.All(c => acceptable.IndexOf(c.ToUpper()) != -1);

For your first expression you would use this.
^([\dA-Za-z]{10})
^ = match beginning of string
( = begin capture group
[ = begin set of characters to match
\d = match all digits (0-9)
A-Za-z = match all uppercase and lowercase letters
] = end character set
{10} = match exactly 10 of the previous character set
) = end capture group
For your second, this one
^.{10}(.*)$
`^.{10} = match the first ten characters of the string (but don't capture them)
`(.*)$ = capture all remaining characters until the end of the string
EDIT:
As pointed out in the comments, you could easily combine these two expressions into one like so.
^([\dA-Za-z]{10})(.*)$
This will yield two capture groups with only one match operation.
It's worth noting that using a RegEx might be a good solution since the match will tell you whether or not the initial ten characters are only alphanumeric characters. If you're only seeking to capture the first ten characters regardless of what they are, then a RegEx is overkill. But if you want validation, a RegEx is a nice way to do that. Performance could be argued though, but you're already using .NET which carries some performance impact anyway.

Remove substring from a list of strings

I have a list of strings that contain banned words. What's an efficient way of checking if a string contains any of the banned words and removing it from the string? At the moment, I have this:
cleaned = String.Join(" ", str.Split().Where(b => !bannedWords.Contains(b,
StringComparer.OrdinalIgnoreCase)).ToArray());
This works fine for single banned words, but not for phrases (e.g. more than one word). Any instance of more than one word should also be removed. An alternative I thought of trying is to use the List's Contains method, but that only returns a bool and not an index of the matching word. If I could get an index of the matching word, I could just use String.Replace(bannedWords[i],"");

A simple String.Replace will not work as it will remove word parts. If "sex" is a banned word and you have the word "sextet", which is not banned, you should keep it as is.
Using Regex you can find whole words and phrases in a text with
string text = "A sextet is a musical composition for six instruments or voices.".
string word = "sex";
var matches = Regex.Matches(text, #"(?<=\b)" + word + #"(?=\b)");
The matches collection will be empty in this case.
You can use the Regex.Replace method
foreach (string word in bannedWords) {
text = Regex.Replace(text, #"(?<=\b)" + word + #"(?=\b)", "")
}
Note: I used the following Regex pattern
(?<=prefix)find(?=suffix)
where 'prefix' and 'suffix' are both \b, which denotes word beginnings and ends.
If your banned words or phrases can contain special characters, it would be safer to escape them with Regex.Escape(word).
Using #zmbq's idea you could create a Regex pattern once with
string pattern =
#"(?<=\b)(" +
String.Join(
"|",
bannedWords
.Select(w => Regex.Escape(w))
.ToArray()) +
#")(?=\b)";
var regex = new Regex(pattern); // Is compiled by default
and then apply it repeatedly to different texts with
string result = regex.Replace(text, "");

It doesn't work because you have conflicting definitions.
When you want to look for sub-sentences like more than one word you cannot split on whitespace anymore. You'll have to fall back on String.IndexOf()

If it's performance you're after, I assume you're not worried about one-time setup time, but rather about continuous performance. So I'd build one huge regular expression containing all the banned expressions and make sure it's compiled - that's as a setup.
Then I'd try to match it against the text, and replace every match with a blank or whatever you want to replace it with.
The reason for this, is that a big regular expression should compile into something comparable to the finite state automaton you would create by hand to handle this problem, so it should run quite nicely.

Why don't you iterate through the list of banned words and look up each of them in the string by using the method string.IndexOf.
For example, you can remove the banned words and phrases with the following piece of code:
myForbWords.ForEach(delegate(string item) {
int occ = str.IndexOf(item);
if(occ > -1) str = str.Remove(occ, item.Length);
});
Type of myForbWords is List<string>.

Get substring from string in C# using Regular Expression

I have a string like:
Brief Exercise 1-1 Types of Businesses Brief Exercise 1-2 Forms of Organization Brief Exercise 1-3 Business Activities.
I want to break above string using regular expression so that it can be like:
Types of Businesses
Forms of Organization
Business Activities.
Please don't say that I can break it using 1-1, 1-2 and 1-3 because it will bring the word "Brief Exercise" in between the sentences. Later on I can have Exercise 1-1 or Problem 1-1 also. So I want some general Regular expression.
Any efficient regular expression for this scenario ?

var regex=new Regex(#"Brief (?:Exercise|Problem) \d+-\d+\s");
var result=string.Join("\n",regex.Split(x).Where(a=>!string.IsNullOrEmpty(a)));
The regex will match "Brief " followed by either "Exercise" or "Problem" (the ?: makes the group non capturing), followed by a space, then 1 or more digits then a "-", then one or more digits then a space.
The second statement uses the split function to split the string into an array and then regex to skip all the empty entries (otherwise the split would include the empty string at the begining, you could use Skip(1) instead of Where(a=>!string.IsNullOrEmpty(a)), and then finally uses string.Join to combine the array back into string with \n as the seperator.
You could use regex.Replace to convert directly to \n but you will end up with a \n at the begining that you would have to strip.
--EDIT---
if the fist number is always 1 and the second number is 1-50ish you could use the following regex to support 0-59
var regex=new Regex(#"Brief (?:Exercise|Problem) 1-\[1-5]?\d\s");

This regular expression will match on "Brief Exercise 1-" followed by a digit and an optional second digit:
#"Brief Exercise 1-\d\d?"
Update:
Since you might have "Problem" as well, an alternation between Exercise and Problem is also needed (using non capturing parenthesis):
#"Brief (?:Exercise|Problem) 1-\d\d?"

Why don't you do it the easy way? I mean, if the regular part is "Brief Exercise #-#" Replace it by some split character and then split the resulting string to obtain what you want.
If you do it otherwise you will always have to take care of special cases.
string pattern = "Brief Exercise \d+-\d+";
Regex reg = new Regex(patter);
string out = regex.replace(yourstring, "|");
string results[] = out.split("|");

Regular expression for numbers in string

The input string "134.45sdfsf" passed to the following statement
System.Text.RegularExpressions.Regex.Match(input, pattern).Success;
returns true for following patterns.
pattern = "[0-9]+"
pattern = "\\d+"
Q1) I am like, what the hell! I am specifying only digits, and not special characters or alphabets. So what is wrong with my pattern, if I were to get false returned value with the above code statement.
Q2) Once I get the right pattern to match just the digits, how do I extract all the numbers in a string?
Lets say for now I just want to get the integers in a string in the format "int.int^int" (for example, "11111.222^3333", In this case, I want extract the strings "11111", "222" and "3333").
Any idea?
Thanks

You are specifying that it contains at least one digit anywhere, not they are all digits. You are looking for the expression ^\d+$. The ^ and $ denote the start and end of the string, respectively. You can read up more on that here.
Use Regex.Split to split by any non-digit strings. For example:
string input = "123&$456";
var isAllDigit = Regex.IsMatch(input, #"^\d+$");
var numbers = Regex.Split(input, #"[^\d]+");

it says that it has found it.
if you want the whole expression to be checked so :
^[0-9]+$

Q1) Both patterns are correct.
Q2) Assuming you are looking for a number pattern "5 digits-dot-3 digits-^-4 digits" - here is what your looking for:
var regex = new Regex("(?<first>[0-9]{5})\.(?<second>[0-9]{3})\^(?<third>[0-9]{4})");
var match = regex.Match("11111.222^3333");
Debug.Print(match.Groups["first"].ToString());
Debug.Print(match.Groups["second"].ToString
Debug.Print(match.Groups["third"].ToString
I prefer named capture groups - they will give a more clear way to acces than

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.