C# reliable way to pattern match?

C# reliable way to pattern match? - c#

At the moment I am trying to match patterns such as
text text date1 date2
So I have regular expressions that do just that. However, the issue is for example if users input data with say more than 1 whitespace or if they put some of the text in a new line etc the pattern does not get picked up because it doesn't exactly match the pattern set.
Is there a more reliable way for pattern matching? The goal is to make it very simple for the user to write but make it easily matchable on my end. I was considering stripping out all the whitespace/newlines etc and then trying to match the pattern with no spaces i.e. texttextdate1date2.
Anyone got any better solutions?
Update
Here is a small example of the pattern I would need to match:
FIND me#test.com 01/01/2010 to 10/01/2010
Here is my current regex:
FIND [A-Z0-9._%+-]+#[A-Z0-9.-]+\.[A-Z]{2,4} [0-9]{1,2}/[0-9]{1,2}/[0-9]{2,4} to [0-9]{1,2}/[0-9]{1,2}/[0-9]{2,4}
This works fine 90% of the time, however, if users submit this information via email it can have all different kinds of formatting and HTML I am not interested in. I am using a combination of the HtmlAgilityPack and a HTML tag removing regex to strip all the HTML from the email, but even at that I can't seem to get a match on some occassions.
I believe this could be a more parsing related question than pattern matching, but I think maybe there is a better way of doing this...

To match at least one or more whitespace characters (space, tab, newline), use:
\s+
Substitute the above wherever you have the physical space in your pattern and you should be fine.

Example of matching multiple groups in a text with multiple whitespaces and/or newlines.
var txt = "text text date1\ndate2";
var matches = Regex.Match(txt, #"([a-z]+)\s+([a-z]+)\s+([a-z0-9]+)\s+([a-z0-9]+)", RegexOptions.Singleline);
matches.Groups[n].Value with n from 1 to 4 will contain your matches.

I would split the string into a string array and match each resulting string to the necessary Regular Expression.

\b(text)[\s]+(text)[\s]+(date1)[\s]+(date2)\b

Its a nasty expression but here is something that will work for the input you provided:
^(\w+)\s+([\w#.]+)\s+(\d{2}\/\d{2}\/\d{4})[^\d]+(\d{2}\/\d{2}\/\d{4})$
This will work with variable amounts of whitespace between the capture groups as well.

Through ORegex you can tokenize your string and just pattern match on token sequences:
var tokens = input.Split(new[]{' ','\t','\n','\r'}, StringSplitOptions.RemoveEmptyEntries);
var oregex = new ORegex<string>("{0}{0}{1}{1}", IsText, IsDate);
var matches = oregex.Matches(tokens); //here is your subsequence tokens.
...
public bool IsText(string str)
{
...
}
public bool IsDate(string str)
{
...
}

Related

How to structure REGEX in C#

I currently have a regex that checks if a US State is spelled correctly
var r = new Regex(string.Format(#"\b(?:{0})\b", pattern), RegexOptions.IgnoreCase)
pattern is a pipe delimited string containing all US states.
It was working as intended today until one of the states was spelled like "Florida.." I would have liked it picked up the fact there was a fullstop character.
I found this regex that will only match letters.
^[a-zA-Z]+
How do I combine this with my current Regex or is it not possible?
I tried some variations of this but it didn't work
var r = new Regex(string.Format(#"\b^[a-zA-Z]+(?:{0})\b", pattern), RegexOptions.IgnoreCase);
EDIT: Florida.. was in my input string. My pattern string hasn't changed at all. Apologies for not being clearer.

It seems you need start of string (^) and end of string ($) anchors:
var r = new Regex(string.Format(#"^(?:{0})$", pattern), RegexOptions.IgnoreCase);
The regex above would match any string comprising a name of a state only.

You should make a replacement of the pattern variable to escape the regex special characters. One of them is the . character. Something similar to pattern.Replace(".", #"\.") but doing all the especial characters.

I believe you can't merge both patterns into one, so you would have to perform two diferent regex operations, one to split the states into a list, and a subsequent one for the validation of each item within it.
I'd rather go for something "simpler" such as
var states = input.Split('|').Select(s => new string(s.Where(char.IsLetter).ToArray()))
.Where(s => !string.IsNullOrWhiteSpace(s));

Basically don't use a regex here.
List<string> values = new List<string>() {"florida", etc.};
string input;
//is input in values, ignore case and look for any value that includes the input value
bool correct = values.Any(a =>
input.IndexOf(a, StringComparison.CurrentCultureIgnoreCase) >= 0);
This will be considerably more efficient than a regex based option. This should match florida, Florida and Florida..., etc.

Don't search for characters directly, tell regex to consume all which are not targeted specific characters such as [^\|.]+. It uses the set [ ] with the not ^ indicator says consume anything which is not a literal | or .. Hence it consumes just the text needed. Such as on
Colorado|Florida..|New Mexico
returns 3 matches of Colorado Florida and New Mexico

Regex replace all matching words that do not contain a certain string

How can I use regex to replace matching strings that do not include a specific string?
input string
Keepword mywordsecond mythirdword myfourthwordKeep
string to replace
word
exclude string
Keep
Desired out put
Keepword mysecond mythird myfourthKeep

Will there ever be more than one word in a word? If there are more than one, do you want to replace all of them? If not, this should sort you out:
Regex r = new Regex(#"\b((?:(?!Keep|word)\w)*)word((?:(?!Keep)\w)*)\b");
s1 = r.Replace(s0, "$1$2");
to explain:
First, \b((?:(?!Keep|word)\w)*) captures whatever text precedes the first occurrence of word or Keep.
The next thing it sees must be word, If it sees Keep or the end of the string instead, the match attempt immediately fails.
Then ((?:(?!Keep)\w)*)\b captures the remainder of the text in order to ensure it doesn't contain Keep.
When faced with a problem like this, most users' first impulse is to match (in the sense of consuming) only the part of the string they're interested in, using lookarounds to establish the context. It's usually much easier to write the regex so that it always moves forward through the string as it matches. You capture the parts you want to retain so you can plug them back into the result string by means of group references ($1, $2, etc.).
Given that you're using C#, you could use the lookaround approach:
Regex r = new Regex(#"(?<!Keep\w*)word(?!\w*Keep)");
s1 = r.Replace(s0, "");
But please don't. There are very few regex flavors that support unrestricted lookbehinds like .NET does, and most problems don't work so neatly as this one anyway.

string str = "Keepword mywordsecond mythirdword myfourthwordKeep";
str = Regex.Replace(str, "(?<!Keep)word", "");
And I'm going to link you to a one of good Regular Expressions Cheat sheet here

This works in notepad++:
(?<!Keep)word(?!Keep)
It uses "look ahead".

You can use negative look-behind assertion if you want to remove all "word" that are not proceeded by "Keep":
String input = "Keepword mywordsecond mythirdword myfourthwordKeep";
String pattern = "(?<!Keep)word";
String output = Regex.Replace(input, pattern, "");

Matching a substring of any length and characters using RegEx

I would like to be able to match and then extract all substrings in the following string using regex in c#:
"2012-05-15 00:49:02 192.168.100.10 POST /Microsoft-Server-ActiveSync/default.eas User=nikced&DeviceId=ApplDNWGRKZQDTC0&DeviceType=iPhone&Cmd=Ping&Log=V121_Sst8_LdapC0_LdapL0_RpcC31_RpcL50_Hb3540_Erq1_Pk1728465481_S2_ 443 redcloud\nikced 94.234.170.42 Apple-iPhone4C1/902.179 200 0 64 3140491"
Since it's a logfile it the regex should be able to handle any line that is of a similar type.
In this case, the preferred output to a collection should be:
2012-05-15
00:49:02
192.168.100.10
/Microsoft-Server-ActiveSync/default.eas
User=nikced&DeviceId=ApplDNWGRKZQDTC0&DeviceType=iPhone&Cmd=Ping&Log=V121_Sst8_LdapC0_LdapL0_RpcC31_RpcL50_Hb3540_Erq1_Pk1728465481_S2_
443
redcloud\nikced
94.234.170.42
Apple-iPhone4C1/902.179
200
0
64
3140491
Appreciate any answer using C#, .net and Regex to extract the above substrings into a collection (MatchCollection preferred). All log lines follows the same format and pattern.

Incredibly complex regex incoming:
logFile.Split(' ');

This will give you an array that you can iterate through to retrieve all of the "lines" which are separated by a space
string[] lines = log.Split(' ');

You don't need to use a Regex. You can simply use String.Split Method, and specify space as separator:
string [] substrings = line.Split(new Char [] {' '});
If you need to identify the kind of each part, then you should specify what you need to find, and a regex can be created for it.
Anyway, if you really want to use a Regex, do this:
Regex re = new Regex (#"(?:(?<s>[^ ]+)(?: |$))*");
This will give you all the captures in the "s" group, when you call the Match method.
As the OP pointed out in a comment that the separator can be anything appart from a single space, then the possible separators should be included in the (?: |$) and the [^ ] parts of the expression. I.e. if space as well as tab are possible separators, replace that part with (?: |\t|$) and [^ \t]. If you need to accept more than one of those characters as separators, add a + after the () group:
(?:(?<s>[^ \t]+)(?: |\t|$)+)*

The fastest and most obvious way is to use String.Split:
string[] substrings = result = line->Split( nullptr, StringSplitOptions::RemoveEmptyEntries );
But if you insist on a MatchCollection then this will do what you want
MatchCollection ^ substrings = Regex.Matches(line, "\\S+")

Really, you just need to break this down into the parts.
First, the date. Will it always be in YYYY-MM-DD format? Could it be possible that it will be different based on region/culture settings?
(?<LogDate>dddd-dd-dd)
Next, you have the time. Same thing:
(?<LogTime>dd:dd:dd)
Next, I'm assuming this is the web method that was actually called? Not entirely sure, since you haven't really explained how the data is laid out. However, I'm assuming it's either going to be either POST or GET, so that's what we're going to do next...
(?<LogMethod>POST|GET)
Just do this for every part of the log line you're interested in, and you'll be set. IE:
(?<LogDate>dddd-dd-dd) (?<LogTime>dd:dd:dd) (?<LogMethod>POST|GET)...
If you want to anchor to the start/end of the line, be sure to use ^ and $ respectively. When you get the Matches, you can get the values from each group by indexing the Groups property with the named group (such as match.Groups["LogMethod"].Value). Good luck!

Regex Word splitting in C#

I know similar questions have been asked before, but I can't find one that is like mine, or enough like mine to help me out :). So essentially I want to split up a string which contains a bunch of words, and I don't want to return any characters that are not words (this is the key problem I am struggling with, ignoring characters). This is how I define the problem:
What constitutes a word is a string of any character a-zA-Z only
(no numbers or anything else)
In between any word, there can be any number of random other characters
I want to get back a string[] containing only the words
eg: text: "apple^&**^orange1247pear"
I want to return: apple, orange, pear in an array.
The closest I have found I suppose is this:
Regex.Split("apple^orange7pear",#"([a-zA-Z]*)")
Which splits out the apple/orange/pear, but also returns a bunch of other junk and blank strings.
Anyone know how to stop the split function from returning certain parts of the string, or is that not possible?
Thanks in advance for any help you give me :)

Split should match the tokens between your words. In your regex you've added a group around the word, so it is included in the result, but that isn't desired in this case. Note that this regex matches anything besides valid words - anything that isn't an ASCII letter:
string[] words = Regex.Split(str, "[^a-zA-Z]+");
Another option is to match the words directly:
MatchCollection matches = Regex.Matches(str, "[a-zA-Z]+");
string[] words2 = matches.Cast<Match>().Select(m => m.Value).ToArray();
The second option is probably clearer, and will not include blank elements on the start or end of the array.

var splits = Regex.Split("aaa $$$bbb ccc", #"[^A-Za-z]+");
But to include non-latin letters, I would use this:
var splits = Regex.Split("aaa $$$bbb ccc", #"\P{L}+");

Try this:
Regex.Matches("kalle kula(/()&//()nisse8978971", #"[A-Za-z]+")
Using Matches() will collect only the words, Split() will divide the string which is not what you want.

The second option Kobi listed is better and easier to control. I use the following regular expression to locate common entities such as words, numbers, email addresses in a string it will.
var regex = new Regex(#"[\p{L}\p{N}\p{M}]+(?:[-.'´_#][\p{L}|\p{N}|\p{M}]+)*", RegexOptions.Compiled);

C# regex need characters after \player_n\

I need a regex pattern which will accommodate for the following.
I get a response from a UDP server, it's a very long string and each word is separated by \, for example:
\g79g97\g879o\wot87gord\player_0\name0\g6868o\g78og89\g79g79\player_1\name1\gyuvui\yivyil\player_2\name2\g7g87\g67og9o\v78v9i7
I need the strings after \player_n\, so in the above example I would need name0, name1 and name3,
I know this is the second regex question of the day but I have the book (Mastering Regular Expressions) on order! Thank you.
UPDATE. elusive's regex pattern will suffice, and I can add the match(0) to a textbox. However, what if I want to add all the matches to the text box ?
textBox1.Text += match.Captures[0].ToString(); //this works fine.
How do I add "all" match.captures to the text box? :s sorry for being so lame, this Regex class is brand new to me .

Try this one:
\\player_\d+\\([^\\]+)

i think that this test sample can help you
string inp = #"\g79g97\g879o\wot87gord\player_0\name0\g6868o\g78og89\g79g79\player_1\name1\gyuvui\yivyil\player_2\name2\g7g87\g67og9o\v78v9i7";
string rex = #"[\w]*[\\]player_[0-9]+[\\](?<name>[A-Za-z0-9]*)\b";
Regex re = new Regex(rex);
Match mat = re.Match(inp);
for (Match m = re.Match(inp); m.Success; m = m.NextMatch())
{
Console.WriteLine(m.Groups["name"]);
}
you can take the name of the player from the m.Groups["name"]

To get only the player name, you could use:
(?<=\\player_\d+\\)[^\\]+
This (?<=\\player_\d+\\) is something called a positive look-behind. It makes sure that the actual match [^\\]+ is preceded by the expression in the parentheses.
In this case, it's even specific to only a few regex engines (.NET being among them, luckily), in that it contains a variable length expression (due to \d+). Most regex engines only support fixed-length look-behind.
In any case, look-behind is not necessarily the best approach to this problem, match groups are simpler easier to read.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

C# reliable way to pattern match? - c#

To match at least one or more whitespace characters (space, tab, newline), use: \s+ Substitute the above wherever you have the physical space in your pattern and you should be fine.

I would split the string into a string array and match each resulting string to the necessary Regular Expression.

\b(text)[\s]+(text)[\s]+(date1)[\s]+(date2)\b

Its a nasty expression but here is something that will work for the input you provided: ^(\w+)\s+([\w#.]+)\s+(\d{2}\/\d{2}\/\d{4})[^\d]+(\d{2}\/\d{2}\/\d{4})$ This will work with variable amounts of whitespace between the capture groups as well.

Related

How to structure REGEX in C#

Regex replace all matching words that do not contain a certain string

Matching a substring of any length and characters using RegEx

Regex Word splitting in C#

C# regex need characters after \player_n\

Categories

Resources