Regex.Split command in c# - c#

I am trying to use Regex.SPlit to split a a string in order to keep all of its contents, including the delimiters i use. The string is a math problem. For example, 5+9/2*1-1. I have it working if the string contains a + sign but I don't know how to add more then one to the delimiter list. I have looked online at multiple pages but everything I try gives me errors. Here is the code for the Regex.Split line I have: (It works for the plus, Now i need it to also do -,*, and /.
string[] everything = Regex.Split(inputBox.Text, #"(\+)");

Use a character class to match any of the math operations: [*/+-]
string input = "5+9/2*1-1";
string pattern = #"([*/+-])";
string[] result = Regex.Split(input, pattern);
Be aware that character classes allow ranges, such as [0-9], which matches any digit from 0 up to 9. Therefore, to avoid accidental ranges, you can escape the - or place it at either the beginning or end of the character class.

Related

Using Regex.Split to remove anything non numeric and splitting on -

I'm not sure why but for some reason The Regex Split method is going over my head. I'm trying to look through tutorials for what I need and can't seem to find anything.
I simply am reading an excel doc and want to format a string such as $145,000-$179,999 to give me two strings. 145000 and 179999. At the same time I'd like to prune a string such as '$180,000-Limit to simply 180000.
var loanLimits = Regex.Matches(Result.Rows[row + 2 + i][column].ToString(), #"\d+");
The above code seems to chop '$145,000-$179,999 up into 4 parts: 145, 000, 179, 999. Any ideas on how to achieve what I'm asking?
Regular expressions match exactly character by character (there's no knowledge of the concept of a "number" or a "word" in regular expressions - you have to define that yourself in your expression). The expression you are using, \d+, uses the character class \d, which means any digit 0-9 (and + means match one or more). So in the expression $145,000, notice that the part you are looking for is not just composed of digits; it also includes commas. So the regular expression finds every continuous group of characters that matches your regular expression, which are the four groups of numbers.
There are a couple of ways to approach the problem.
Include , in your regular expression, so (\d|,)+, which means match as many characters in a row that are either a digit or a comma. There will be two matches: 145,000 and 179,999, from which you can further remove the commas with myStr.Replace(",", ""). (DEMO)
Do as you say in the title, and remove all non-numeric characters. So you could use Regex.Replace with the expression [^\d-]+ - which means match anything that is not a digit or a hyphen - and then replace those with "". Then the result would be 145000-179999, which you can split with a simple non-regular-expression split, myStr.Split('-'), to get your two parts. (DEMO)
Note that for your second example ($180,000-Limit), you'll need an extra check to count the number of results returned from Match in the first example, and Split in the second example to determine whether there were two numbers in the range, or only a single number.
you can try to treat each string separately by spiting it based on - and extraction only numbers from it
ArrayList mystrings = new ArrayList();
List<string> myList = Result.Rows[row + 2 + i][column].ToString().Split('-').ToList();
foreach(var item in myList)
{
string result = Regex.Replace(item, #"[^\d]", "");
mystrings.Add(result);
}
An alternative to using RegEx is to use the built in string and char methods in the DotNet framework. Assuming the input string will always have a single hypen:
string input = "$145,000-$179,999";
var split = input.Split( '-' )
.Select( x => string.Join( "", x.Where( char.IsLetterOrDigit ) ) )
.ToList();
string first = split.First(); //145000
string second = split.Last(); //179999
first you split the string using the standard Split method
then you create a new string by selectively taking only Letters or Digits from each item in the collection: x.Where...
then you join the string using the standard Join method
finally, take the first and last item in the collection for your 2 strings.

Split String using delimiter that exists in the string

I have a problem and I am wondering if there is any smart workaround.
I need to pass a string through a socket to a web application. This string has three parts and I use the '|' as a delimiter to split at the receiving application into the three separate parts.
The problem is that the '|' character can be a character in any of the 3 separate strings and when this occurs the whole splitting action distorts the strings.
My question therefore is this:
Is there a way to use a char/string as a delimiter in some text while this char/string itself might be in the text?
The general pattern is to escape the delimiter character. E.g. when '|' is the delimiter, you could use "||" whenever you need the character itself inside a string (might be difficult if you allow empty strings) or you could use something like '\' as the escape character so that '|' becomes "\|" and "\" itself would be "\\"
The matter here is that given the following string:
string toParse = "What|do you|want|to|say|?";
It can be parsed in many several ways:
"What
do you
want|to|say|?"
or
"What|do you
want
to|say|?"
and so on...
You can define rules to parse your string, but coding it will be hard, and it will seem counter intuitive to the final user.
The string must contains an escape character that indicates that the symbol "|" is wanted, not the separator.
This could be for example "\|".
Here a full example using regex:
using System.Text.RegularExpressions;
//... Put this in the main method of a Console Application for instance.
// The '#' character before the strings are to specify "raw" strings, where escape characters '\' are not escaped
Regex reg = new Regex(#"^((?<string1>([^\|]|\\\|)+)\|)((?<string2>([^\|]|\\\|)+)\|)(?<string3>([^\|]|\\\|)+)$");
string toTest = #"user\|dureuill|deserves|an\|upvote";
MatchCollection matches = reg.Matches(toTest);
if (matches.Count != 1)
{
throw new FormatException("Bad formatted pattern.");
}
Match match = matches[0];
string string1 = match.Groups["string1"].Value.Replace(#"\|", "|");
string string2 = match.Groups["string2"].Value.Replace(#"\|", "|");
string string3 = match.Groups["string3"].Value.Replace(#"\|", "|");
Console.WriteLine(string1);
Console.WriteLine(string2);
Console.WriteLine(string3);
Console.ReadKey();
Is there a way to use a char/string as a delimiter in some text while
this char/string itself might be in the text?
Simple answer: No.
This is of course when the string/delimiter is exactly the same, without doing modifications to the text.
There are of course possible workarounds. One possible solution is that you might want to have a minimum/fixed width between delimiters, this is not perfect however.
Another possible solution is to select a delimiter (sequence of characters) that will never occur together in your text. This requires you to change the source and consumer.
When I need to use delimiters I normally select a delimiter that I am 99.9% sure will never occur in normal text, the delimiter may vary depending on what kind of text that I expect.
Here's a quote from Wikipedia:
Because delimiter collision is a very common problem, various methods
for avoiding it have been invented. Some authors may attempt to avoid
the problem by choosing a delimiter character (or sequence of
characters) that is not likely to appear in the data stream itself.
This ad-hoc approach may be suitable, but it necessarily depends on a
correct guess of what will appear in the data stream, and offers no
security against malicious collisions. Other, more formal conventions
are therefore applied as well.
Just a side note to your use-case, why not use a protocol for the data that is sent? Such as protobuf?
Maybe it is useful to HTMLEncode and HTMLDecode your strings first and then attach them together with your delimiter.
I think you either
1)Find a character or set of characters together that would never appear in the string
or
2)Use fixed length strings and pad.
Maybe adapt the delimeter if you have the flexibility to do this? So instead of String1|String2 the string could read "String1"|"String2".
If pipes are unwanted - put some simple validation in place during creation/entry of this string?
Instead of using | as delimiter, you could find a delimiter that's not present in the message parts and pass it along at the beginning of the sent message. Here's an example using an integer as delimiter:
String[] parts = {"this is a message", "it's got three parts", "this one's the last"};
String delimiter = null;
for (int i = 0; i < 100; i++) {
String s = Integer.toString(i);
if (parts[0].contains(s) || parts[1].contains(s) || parts[2].contains(s))
continue;
delimiter = s;
break;
}
String message = delimiter + "#" + parts[0] + delimiter + parts[1] + delimiter + parts[2];
Now the message is 0#this is a message0it's got three parts0this one's the last.
On the receiving end you start by finding the delimiter and split the message string on that:
String[] tmp = message.split("#", 2);
String[] parts = tmp[1].split(tmp[0]);
It's not the most efficient possible solution, since it requires scanning the message parts several times, but it's very easy to implement. If you don't find a value for delimiter and null happens to be part of the message, you might experience unexpected results.

Matching a substring of any length and characters using RegEx

I would like to be able to match and then extract all substrings in the following string using regex in c#:
"2012-05-15 00:49:02 192.168.100.10 POST /Microsoft-Server-ActiveSync/default.eas User=nikced&DeviceId=ApplDNWGRKZQDTC0&DeviceType=iPhone&Cmd=Ping&Log=V121_Sst8_LdapC0_LdapL0_RpcC31_RpcL50_Hb3540_Erq1_Pk1728465481_S2_ 443 redcloud\nikced 94.234.170.42 Apple-iPhone4C1/902.179 200 0 64 3140491"
Since it's a logfile it the regex should be able to handle any line that is of a similar type.
In this case, the preferred output to a collection should be:
2012-05-15
00:49:02
192.168.100.10
/Microsoft-Server-ActiveSync/default.eas
User=nikced&DeviceId=ApplDNWGRKZQDTC0&DeviceType=iPhone&Cmd=Ping&Log=V121_Sst8_LdapC0_LdapL0_RpcC31_RpcL50_Hb3540_Erq1_Pk1728465481_S2_
443
redcloud\nikced
94.234.170.42
Apple-iPhone4C1/902.179
200
0
64
3140491
Appreciate any answer using C#, .net and Regex to extract the above substrings into a collection (MatchCollection preferred). All log lines follows the same format and pattern.
Incredibly complex regex incoming:
logFile.Split(' ');
This will give you an array that you can iterate through to retrieve all of the "lines" which are separated by a space
string[] lines = log.Split(' ');
You don't need to use a Regex. You can simply use String.Split Method, and specify space as separator:
string [] substrings = line.Split(new Char [] {' '});
If you need to identify the kind of each part, then you should specify what you need to find, and a regex can be created for it.
Anyway, if you really want to use a Regex, do this:
Regex re = new Regex (#"(?:(?<s>[^ ]+)(?: |$))*");
This will give you all the captures in the "s" group, when you call the Match method.
As the OP pointed out in a comment that the separator can be anything appart from a single space, then the possible separators should be included in the (?: |$) and the [^ ] parts of the expression. I.e. if space as well as tab are possible separators, replace that part with (?: |\t|$) and [^ \t]. If you need to accept more than one of those characters as separators, add a + after the () group:
(?:(?<s>[^ \t]+)(?: |\t|$)+)*
The fastest and most obvious way is to use String.Split:
string[] substrings = result = line->Split( nullptr, StringSplitOptions::RemoveEmptyEntries );
But if you insist on a MatchCollection then this will do what you want
MatchCollection ^ substrings = Regex.Matches(line, "\\S+")
Really, you just need to break this down into the parts.
First, the date. Will it always be in YYYY-MM-DD format? Could it be possible that it will be different based on region/culture settings?
(?<LogDate>dddd-dd-dd)
Next, you have the time. Same thing:
(?<LogTime>dd:dd:dd)
Next, I'm assuming this is the web method that was actually called? Not entirely sure, since you haven't really explained how the data is laid out. However, I'm assuming it's either going to be either POST or GET, so that's what we're going to do next...
(?<LogMethod>POST|GET)
Just do this for every part of the log line you're interested in, and you'll be set. IE:
(?<LogDate>dddd-dd-dd) (?<LogTime>dd:dd:dd) (?<LogMethod>POST|GET)...
If you want to anchor to the start/end of the line, be sure to use ^ and $ respectively. When you get the Matches, you can get the values from each group by indexing the Groups property with the named group (such as match.Groups["LogMethod"].Value). Good luck!

Regex Word splitting in C#

I know similar questions have been asked before, but I can't find one that is like mine, or enough like mine to help me out :). So essentially I want to split up a string which contains a bunch of words, and I don't want to return any characters that are not words (this is the key problem I am struggling with, ignoring characters). This is how I define the problem:
What constitutes a word is a string of any character a-zA-Z only
(no numbers or anything else)
In between any word, there can be any number of random other characters
I want to get back a string[] containing only the words
eg: text: "apple^&**^orange1247pear"
I want to return: apple, orange, pear in an array.
The closest I have found I suppose is this:
Regex.Split("apple^orange7pear",#"([a-zA-Z]*)")
Which splits out the apple/orange/pear, but also returns a bunch of other junk and blank strings.
Anyone know how to stop the split function from returning certain parts of the string, or is that not possible?
Thanks in advance for any help you give me :)
Split should match the tokens between your words. In your regex you've added a group around the word, so it is included in the result, but that isn't desired in this case. Note that this regex matches anything besides valid words - anything that isn't an ASCII letter:
string[] words = Regex.Split(str, "[^a-zA-Z]+");
Another option is to match the words directly:
MatchCollection matches = Regex.Matches(str, "[a-zA-Z]+");
string[] words2 = matches.Cast<Match>().Select(m => m.Value).ToArray();
The second option is probably clearer, and will not include blank elements on the start or end of the array.
var splits = Regex.Split("aaa $$$bbb ccc", #"[^A-Za-z]+");
But to include non-latin letters, I would use this:
var splits = Regex.Split("aaa $$$bbb ccc", #"\P{L}+");
Try this:
Regex.Matches("kalle kula(/()&//()nisse8978971", #"[A-Za-z]+")
Using Matches() will collect only the words, Split() will divide the string which is not what you want.
The second option Kobi listed is better and easier to control. I use the following regular expression to locate common entities such as words, numbers, email addresses in a string it will.
var regex = new Regex(#"[\p{L}\p{N}\p{M}]+(?:[-.'ยด_#][\p{L}|\p{N}|\p{M}]+)*", RegexOptions.Compiled);

C# reliable way to pattern match?

At the moment I am trying to match patterns such as
text text date1 date2
So I have regular expressions that do just that. However, the issue is for example if users input data with say more than 1 whitespace or if they put some of the text in a new line etc the pattern does not get picked up because it doesn't exactly match the pattern set.
Is there a more reliable way for pattern matching? The goal is to make it very simple for the user to write but make it easily matchable on my end. I was considering stripping out all the whitespace/newlines etc and then trying to match the pattern with no spaces i.e. texttextdate1date2.
Anyone got any better solutions?
Update
Here is a small example of the pattern I would need to match:
FIND me#test.com 01/01/2010 to 10/01/2010
Here is my current regex:
FIND [A-Z0-9._%+-]+#[A-Z0-9.-]+\.[A-Z]{2,4} [0-9]{1,2}/[0-9]{1,2}/[0-9]{2,4} to [0-9]{1,2}/[0-9]{1,2}/[0-9]{2,4}
This works fine 90% of the time, however, if users submit this information via email it can have all different kinds of formatting and HTML I am not interested in. I am using a combination of the HtmlAgilityPack and a HTML tag removing regex to strip all the HTML from the email, but even at that I can't seem to get a match on some occassions.
I believe this could be a more parsing related question than pattern matching, but I think maybe there is a better way of doing this...
To match at least one or more whitespace characters (space, tab, newline), use:
\s+
Substitute the above wherever you have the physical space in your pattern and you should be fine.
Example of matching multiple groups in a text with multiple whitespaces and/or newlines.
var txt = "text text date1\ndate2";
var matches = Regex.Match(txt, #"([a-z]+)\s+([a-z]+)\s+([a-z0-9]+)\s+([a-z0-9]+)", RegexOptions.Singleline);
matches.Groups[n].Value with n from 1 to 4 will contain your matches.
I would split the string into a string array and match each resulting string to the necessary Regular Expression.
\b(text)[\s]+(text)[\s]+(date1)[\s]+(date2)\b
Its a nasty expression but here is something that will work for the input you provided:
^(\w+)\s+([\w#.]+)\s+(\d{2}\/\d{2}\/\d{4})[^\d]+(\d{2}\/\d{2}\/\d{4})$
This will work with variable amounts of whitespace between the capture groups as well.
Through ORegex you can tokenize your string and just pattern match on token sequences:
var tokens = input.Split(new[]{' ','\t','\n','\r'}, StringSplitOptions.RemoveEmptyEntries);
var oregex = new ORegex<string>("{0}{0}{1}{1}", IsText, IsDate);
var matches = oregex.Matches(tokens); //here is your subsequence tokens.
...
public bool IsText(string str)
{
...
}
public bool IsDate(string str)
{
...
}

Categories