Regex Spilt based on multiple delimiters in C# - c#

I have a string of type "KeyOperatorValue1,Value2,Value2....". For e.g = "version>=5", "lang=en,fr,es" etc and currently, the possible value for operator field is "=", "!=", ">", ">=", "<", "<=", but I don't want it to be limited to them only. Now the problem is given such a string, how can I split into a triplet?
Since, all the operator's string representation are not mutually exclusive("=" is a subset of ">="), I can't use public string[] Split(string[] separator, StringSplitOptions options) and the Regex.Split doesn't have a variant which takes multiple regex as parameters.

Since you have not mentioned the format of your input I have made certain assumptions..
I have assumed that
key would always contains alphanumeric characters
values would always be alphanumeric characters optionally separated by ,
key-value pair would be separated by non word characters
(?<key>\w+)(?<operand>[^\w,]+)(?<value>[\w,]+)
So this would match a string as operand if its not , or any one of [a-zA-Z\d_]
You can use this code
var lst=Regex.Matches(input,regex)
.Cast<Match>()
.Select(x=>new{
key=x.Groups["key"].Value,
operand=x.Groups["operand"].Value,
value=x.Groups["value"].Value
});
You can now iterate over lst
foreach(var l in lst)
{
l.key;
l.operand;
l.value;
}

Regex has "or" operator (separators will be included in the result though):
Regex.Split(#sourceString, #"(>=)|(<=)|(!=)|(=)|(>)|(<)");

You don't have to use regular expressions to accomplish that. Simply store the operators in an array. Keep the array sorted by the length of the operators. Iterate over the operators and get the position of the operator using IndexOf(). Now you can use Substring() to extract the key and the values from your input string.

You can just use branching to provide multiple alternatives. There are multiple possibilities to achieve this, one example would be this:
(\w+)([!<>]?=|[<>])(.*)
As you can see this expression contains three separate capture groups:
(\w+?): This will match "word" character (alphanumerical and underscores), as long as the sequence is at least one character long (+).
([!<>]?=|[<>]): This expression matches the operators given in your example. The first half ([!<>]?=) will match any of the characters inside [] (or skip it (?)) followed by =. The alternative simply matches < or >.
(.*): This will match any character (or nothing), whatever follows till the end of the string/line.
So when you match the expression, you'll get a total of 4 (sub) matches:
1: The name of the key.
2: The operator used.
3: The actual value given.
Edit:
If you'd like to match other operators as well, you'd have to add them as additional branches in the second matching group:
(\w+)([!<>]?=|[<>]|HERE)(.*)
Just keep in mind that there's in general no 100% perfect way to match any operator without defining the exact characters that should be considered valid operands (or components of an operand).

Related

Determining if a String begins with an Alphabetical Char and Contains Numerical Digits

I am making an infix evaluator in which the legal tokens include: +, -, *, /, (, ), non-negative integers, and any String that begins with one or more letters and ends with one or more digits.
I am trying to find the most efficient way to determine if a given String begins with one or more letters and ends with one or more digits. The catch is that the alphabetical characaters must come before the numerical values (e.g. X1, XX1, X11). However, if the String contains something similar to 1X, X1X, X#1, then the input is invalid.I know this encloses many possibilities, and I hope there is a way to simplify it.
Thus far I have researched methods such as the String's Any, StartsWith, and EndsWith functions. I just feel like there are too many possibilities to simplify this into a short lambda expression or one-liner. In fact, since we aren't necessarily guaranteed any kind of input, it seems that all N characters would have to be check in order to ensure that the these conditions be met.
Below is the code that I have thus far. This code includes breaking the input String up based on the RegularExpression #"([()+*/-])"
public static string[] parseString(String infixExp)
{
/* In a legal expression, the only possible tokens are (, ),
* +, -, *, /, non-negative integers, and strings that begin
* with one or more letters and end with one or more digits.
*/
// Ignore all whitespace within the expression.
infixExp = Regex.Replace(infixExp, #"\s+", String.Empty);
// Seperate the expression based on the tokens (, ), +, -,
// *, /, and ignore any of the empty Strings that are added
// due to duplicates.
string[] substrings = Regex.Split(infixExp, #"([()+*/-])").Where(s => s != String.Empty).ToArray();
// Return the resulting substrings array such that it
// can be processed by the Evaluate function.
return substrings;
}
If you have any suggestive approaches and/or any references such that I could solve this issue, please feel free!
You can try
char[] letters = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz".ToCharArray();
char[] numbers = "1234567890".ToCharArray();
foreach(string s in infixExp.Split(' '))
{
if(letters.Contains(s[0]))
{
//do stuff
}
if(numbers.Contains(s[s.Length-1]))
{
//do more stuff
}
}
To determine if a given string begins with one or more letters and ends with one or more digits, you can use the the static IsMatch function from the Regex class.
Regex.IsMatch:
Indicates whether the regular expression finds a match in the input string.
Then you can supply the regular expression ^[A-Z]+[0-9]+$ as the second parameter to the method . Likewise, if you wish to ignore case then you can supply the RegexOptions RegexOptions.IgnoreCase
This will ensure that a string input begins with at least one character and is followed by at least one digit (e.g. XX1, X3, XZ9).
More info on the Regular Expression Language can be found via this link. I hope this helps anyone in the future!

Using Regex.Split to remove anything non numeric and splitting on -

I'm not sure why but for some reason The Regex Split method is going over my head. I'm trying to look through tutorials for what I need and can't seem to find anything.
I simply am reading an excel doc and want to format a string such as $145,000-$179,999 to give me two strings. 145000 and 179999. At the same time I'd like to prune a string such as '$180,000-Limit to simply 180000.
var loanLimits = Regex.Matches(Result.Rows[row + 2 + i][column].ToString(), #"\d+");
The above code seems to chop '$145,000-$179,999 up into 4 parts: 145, 000, 179, 999. Any ideas on how to achieve what I'm asking?
Regular expressions match exactly character by character (there's no knowledge of the concept of a "number" or a "word" in regular expressions - you have to define that yourself in your expression). The expression you are using, \d+, uses the character class \d, which means any digit 0-9 (and + means match one or more). So in the expression $145,000, notice that the part you are looking for is not just composed of digits; it also includes commas. So the regular expression finds every continuous group of characters that matches your regular expression, which are the four groups of numbers.
There are a couple of ways to approach the problem.
Include , in your regular expression, so (\d|,)+, which means match as many characters in a row that are either a digit or a comma. There will be two matches: 145,000 and 179,999, from which you can further remove the commas with myStr.Replace(",", ""). (DEMO)
Do as you say in the title, and remove all non-numeric characters. So you could use Regex.Replace with the expression [^\d-]+ - which means match anything that is not a digit or a hyphen - and then replace those with "". Then the result would be 145000-179999, which you can split with a simple non-regular-expression split, myStr.Split('-'), to get your two parts. (DEMO)
Note that for your second example ($180,000-Limit), you'll need an extra check to count the number of results returned from Match in the first example, and Split in the second example to determine whether there were two numbers in the range, or only a single number.
you can try to treat each string separately by spiting it based on - and extraction only numbers from it
ArrayList mystrings = new ArrayList();
List<string> myList = Result.Rows[row + 2 + i][column].ToString().Split('-').ToList();
foreach(var item in myList)
{
string result = Regex.Replace(item, #"[^\d]", "");
mystrings.Add(result);
}
An alternative to using RegEx is to use the built in string and char methods in the DotNet framework. Assuming the input string will always have a single hypen:
string input = "$145,000-$179,999";
var split = input.Split( '-' )
.Select( x => string.Join( "", x.Where( char.IsLetterOrDigit ) ) )
.ToList();
string first = split.First(); //145000
string second = split.Last(); //179999
first you split the string using the standard Split method
then you create a new string by selectively taking only Letters or Digits from each item in the collection: x.Where...
then you join the string using the standard Join method
finally, take the first and last item in the collection for your 2 strings.

C# regex match behaviour

I've got this line in my code:
Match match = Regex.Match(actualValue, regexValue, RegexOptions.None);
I've got a simple question. why when checking for success meaning with the line:
if(match.Success)
then the match does succeed with the following values:
actualValue = "G:1"
regexValue = "A*"
the actual does not seem to fit at least for me so i probably miss something...
what i do want to achieve is just receiving an actual value and a regular expression and check if the actual value fits the regular expression.. i thought that's what i did there but apparently i didn't.
EDIT: another question. is there a way to treat the * as the "any char" wildcard? meaning is it possible that A* will be considered as A and after it any char is possible?
Your code itself is correct; your regular expression isn't.
Based on your comments on other answers, you're after a regular expression which matches any string which starts with A, and you're assuming that '*' means "any characters". '*' in fact means "match the preceding character zero or more times", so the regular expression you've given means "match the start of the string followed by zero or more 'A' characters", which will match absolutely anything.
If you're looking for a regular expression that matches the whole string but only if it starts with 'A', the regular expression you're after is ^A.*. The '.' character in a regular expression means "match any character". This regular expression thus means "match the start of the string, followed by an 'A', followed by zero or more other characters" and will thus match the entire string provided it starts with 'A'.
However, you already have the whole string, so this is a little unnecessary - all you really want to do is get an answer to the question "does the string start with an 'A'?". A regular expression that will achieve this is simply '^A'. If it matches, the string started with an 'A'.
Of course, it should be pointed out that you don't need a regular expression to confirm this anyway. If this is genuinely all you want to do (and it's possible you've just put together a simple example, and your real scenario is more complicated), why not just use the StartsWith method?:
bool match = actualValue.StartsWith("A");
The regex matches because A* means "look for 0 or more occurrences of 'A'". It will match any string.
If you meant to look for an arbitrary number of 'A', but at least one, try A+ instead.
Looking at the comments it looks like you're trying to match a lot of strings starting with A.
If they're separated by white space you could find all of them using the following:
bool matched = Regex.IsMatch(actualValue, #"\bA\w+");
This matches : "Atest flkjs Apple Ascii cAse".
If there is only one string you're matching and it starts with A and has no spaces:
bool matched = Regex.IsMatch(actualValue, #"^A\w+$");
This matches "Apple", but not "Apple and orange" as the second string has spaces.
As Chris noted * is not a wildcard in the way you meant with regex searches. You can find some information to get you started with regexes at regex-info.
Regex take the regular expression in the constructor.
Exampel in your case could be :
if(new Regex("A*").IsMatch(actualValue)
//Do something
If you are unsecure of the regexpattern, try it out here

how to create regular expression based on some condition

i want to create a regular expression to find and replace uppercase character based on some condition.
find the starting uppercase for a group of uppercase character in a string and replace it lowercase and * before the starting uppercase.
If there is any lowercase following the uppercase,replace the uppercase with lowercase and * before the starting uppercase.
input string : stackOVERFlow
expected output : stack*over*flow
i tried but could not get it working perfectly.
Any idea on how to create a regular expression ?
Thanks
Well the expected inputs and outputs are slightly illogical: you're lower-casing the "f" in "flow" but not including it in the asterisk.
Anyway, the regex you want is pretty simple: #"[A-Z]+?". This matches a string of one or more uppercase alpha characters, nongreedily (don't think it makes a difference either way as the matched character class is relatively narrow).
Now, to do the find/replace, you would do something like the following:
Regex.Replace(inputString, #"([A-Z]+?)", "*$1*").ToLower();
This simply finds all occurrences of one or more uppercase alpha characters, and wherever it finds a match it replaces it with itself surrounded by asterisks. This does the surrounding but not the lowercasing; .NET Regex doesn't provide for that kind of string modification. However, since the end result of the operation should be a string with all lowercase chars, just do exactly that with a ToLower() and you'll get the expected result.
KeithS's solution can be simplified a bit
Regex.Replace("stackOVERFlow","[A-Z]+","*$0*").ToLower()
However, this will yield stack*overf*low including the f between the stars. If you want to exclude the last upper case letter, use the following expression
Regex.Replace("stackOVERFlow","[A-Z]+(?=[A-Z])","*$0*").ToLower()
It will yield stack*over*flow
This uses the pattern find(?=suffix), which finds a position before a suffix.

How can you match words with more than one character?

I would like to use a regular expression to match all words with more that one character, as opposed to words entirely made of the same char.
This should not match: ttttt, rrrrr, ggggggggggggg
This should match: rttttttt, word, wwwwwwwwwu
The following expression will do the trick.
^(?<FIRST>[a-zA-Z])[a-zA-Z]*?(?!\k<FIRST>)[a-zA-Z]+$
capture the first character into the group FIRST
capture some more characters (lazily to avoid backtracking)
ensure that that the next character is different from FIRST using a negative lookahead assertion
capture all (at least one due to the assertion) remaining characters
Note that is sufficient to look for a character that is different from the first one, because if no character is different from the first one, all characters are equal.
You can shorten the expression to the following.
^(\w)\w*?(?!\1)\w+$
This will match some more characters other than [a-zA-Z].
I would add all unique words to a list and then used this regex
\b(\w)\1+\b
to grab all one character words and get rid of them
This doesn't use a regular expression, but I believe it will do what you require:
public bool Match(string str)
{
return string.IsNullOrEmpty(str)
|| str.ToCharArray()
.Skip(1)
.Any( c => !c.Equals(str[0]) );
}
The following RE will do the opposite of what you're asking for: match where a word is composed of the same character. It may still be useful to you though.
\b(\w)\1*\b
\b\w*?(\w)\1*(?:(?!\1)\w)\w*\b
or
\b(\w)(?!\1*\b)\w*\b
This assumes you're plucking the words out of some larger text; that's why it needs the word boundaries and the padding. If you have a list of words and you're just trying to validate the ones that meet the criteria, a much simpler regex would probably do:
(.)(?:(?!\1).)
...because you already know each word contains only word characters. On the other hand, depending on your definition of "word" you might need to replace \w in the first two regexes with something more specific, like [A-Za-z].

Categories