Regex tokenize issue

Regex tokenize issue - c#

I have strings input by the user and want to tokenize them. For that, I want to use regex and now have a problem with a special case.
An example string is
Test + "Hello" + "Good\"more" + "Escape\"This\"Test"
or the C# equivalent
#"Test + ""Hello"" + ""Good\""more"" + ""Escape\""This\""Test"""
I am able to match the Test and + tokens, but not the ones contained by the ". I use the " to let the user specify that this is literally a string and not a special token. Now if the user wants to use the " character in the string, I thought of allowing him to escape it with a \.
So the rule would be: Give me everything between two " ", but the character in front of the last " can not be a \.
The results I expect are: "Hello" "Good\"more" "Escape\"This\"Test"
I need the " " characters to be in the final match so I know that this is a string.
I currently have the regex #"""([\w]*)(?<!\\"")""" which gives me the following results: "Hello" "more" "Test"
So the look behind isn't working as I want it to be. Does anyone know the correct way to get the string like I want?

Here's an adaption of a regex I use to parse command lines:
(?!\+)((?:"(?:\\"|[^"])*"?|\S)+)
Example here at regex101
(adaption is the negative look-ahead to ignore + and checking for \" instead of "")
Hope this helps you.
Regards.
Edit:
If you aren't interested in surrounding quotes:
(?!\+)(?:"((?:\\"|[^"])*)"?|(\S+))

To make it safer, I'd suggest getting all the substrings within unescaped pairs of "..." with the following regex:
^(?:[^"\\]*(?:\\.[^"\\]*)*("[^"\\]*(?:\\.[^"\\]*)*"))+
It matches
^ - start of string (so that we could check each " and escape sequence)
(?: - Non-capturing group 1 serving as a container for the subsequent subpatterns
[^"\\]*(?:\\.[^"\\]*)* - matches 0+ characters other than " and \ followed with 0+ sequences of \\. (any escape sequence) followed with 0+ characters other than " and \ (thus, we avoid matching the first " that is escaped, and it can be preceded with any number of escape sequences)
("[^"\\]*(?:\\.[^"\\]*)*") - Capture group 1 matching "..." substrings that may contain any escape sequences inside
)+ - end of the first non-capturing group that is repeated 1 or more times
See the regex demo and here is a C# demo:
var rx = "^(?:[^\"\\\\]*(?:\\\\.[^\"\\\\]*)*(\"[^\"\\\\]*(?:\\\\.[^\"\\\\]*)*\"))+";
var s = #"Test + ""Hello"" + ""Good\""more"" + \""Escape\""This\""Test\"" + ""f""";
var matches = Regex.Matches(s, rx)
.Cast<Match>()
.SelectMany(m => m.Groups[1].Captures.Cast<Capture>().Select(p => p.Value).ToArray())
.ToList();
Console.WriteLine(string.Join("\n", matches));
UPDATE
If you need to remove the tokens, just match and capture all outside of them with this code:
var keep = "[^\"\\\\]*(?:\\\\.[^\"\\\\]*)*";
var rx = string.Format("^(?:(?<keep>{0})\"{0}\")+(?<keep>{0})$", keep);
var s = #"Test + ""Hello"" + ""Good\""more"" + \""Escape\""This\""Test\"" + ""f""";
var matches = Regex.Matches(s, rx)
.Cast<Match>()
.SelectMany(m => m.Groups["keep"].Captures.Cast<Capture>().Select(p => p.Value).ToArray())
.ToList();
Console.WriteLine(string.Join("", matches));
See another demo
Output: Test + + + \"Escape\"This\"Test\" + for #"Test + ""Hello"" + ""Good\""more"" + \""Escape\""This\""Test\"" + ""f""";.

Related

C# Regular Expression Look Ahead S

I want to validate a string to see if it is at least 6 digits long and has at least 1 int.
string text = "I want to run 10 miles daily";
string pattern = #"(?=\.*\d).{6,}";
Match match = Regex.Match(text, pattern);
Console.WriteLine(match.Value);
Please explain me why I am getting the below output:
"10 miles daily"

The reason you get "10 miles daily" is because you specify a positive lookahead (?=\.*\d) which matches a literal dot zero or more times and then a digit.
That assertion succeeds at the position before the 1 where it matches zero times a dot and then a digit:
I want to run 10 miles daily
..............|
From that moment you match any character zero or more times which will match .{6,} which matches:
I want to run 10 miles daily
^^^^^^^^^^^^^^
You could update your regex to remove the backslash before the dot and use anchors to assert the start ^ and the end $ of the line:
^(?=.*\d).{6,}$
That would match
^ Assert begin of a line
(?=.*\d) Positive lookahead to assert what what followes contains a digit
.{6,} Match any character 6 more times
$ Assert the end of a line

I know this doesn't answer your question, but I assume you're looking for a working RegEx. Your statement was I want to validate a string to see if it is at least 6 digits long and has at least 1 int.. I assume you mean at least 6 characters long (including white space) and has at least 1 int.
This should do it (C#):
#"^(?=.*\d+)(?=.*[\w]).{6,}$";
RegEx Analyzer: UltraPico Expresso RegEx Tool
Test code and output (C#)
tatic void Main(string[] args)
{
string text = "abcdef";
Match match;
string pattern = #"^(?=.*\d+)(?=.*[\w]).{6,}$";
match = Regex.Match(text, pattern, RegexOptions.IgnoreCase);
Console.WriteLine("Text:'"+text + "'. Matched:" + match.Success + ". Value:" + match.Value);
text = "abcdefg";
match = Regex.Match(text, pattern, RegexOptions.IgnoreCase);
Console.WriteLine("Text:'" + text + "'. Matched:" + match.Success + ". Value:" + match.Value);
text = "abcde1";
match = Regex.Match(text, pattern, RegexOptions.IgnoreCase);
Console.WriteLine("Text:'" + text + "'. Matched:" + match.Success + ". Value:" + match.Value);
text = "abcd21";
match = Regex.Match(text, pattern, RegexOptions.IgnoreCase);
Console.WriteLine("Text:'" + text + "'. Matched:" + match.Success + ". Value:" + match.Value);
text = "abcd dog cat 21";
match = Regex.Match(text, pattern, RegexOptions.IgnoreCase);
Console.WriteLine("Text:'" + text + "'. Matched:" + match.Success + ". Value:" + match.Value);
Console.ReadKey();
}

\.*\d this means string ends with a digit.
(?=\.*\d) this take target digit in string
(?=\.*\d). this pattern means, in a string, everything after the digit found.
(?=\.*\d).{6,} every characters after digit and match must be at least 6 characters.
You need this regex : (?=.*\d).{6,}$. This return as a result at least 6 character and has at least one of them is digit

\.*\d => means a digit preceded by literal dots(`.`)
and * means repeating (0 ~ ) numbers of the character, dot(.).
Thus, as you can see, there is no character dot(.) in your input string. So regex engine try to match by evaluating *'s repeating number as zero, 0.
As a result, your regex may be interpreted as follows.
(?=\d).{6,}
Yes, this regex means one digit followed by more than 5 numbers of any characters.
But, {6,} means greedy search which searches possible maximum length string. On the other hand, lazymode( {6,}? in this case) searches possible minimum length string.
You can try this lazy mode regex and compare to the above greedy one's result.
(?=\d).{6,}?

using Regex to iterate over a string and search for 3 consecutive hyphens and replace it with [space][hyphen][space]

I currently have a string which looks like this when it is returned :
//This is the url string
// the-great-debate---toilet-paper-over-or-under-the-roll
string name = string.Format("{0}",url);
name = Regex.Replace(name, "-", " ");
And when I perform the following Regex operation it becomes like this :
the great debate toilet paper over or under the roll
However, like I mentioned in the question, I want to be able to apply regex to the url string so that I have the following output:-
the great debate - toilet paper over or under the roll
I would really appreciate any assistance.
[EDIT] However, not all the strings look like this, some of them just have a single hyphen so the above method work
world-water-day-2016
and it changes to
world water day 2016
but for this one:
the-great-debate---toilet-paper-over-or-under-the-roll
I need a way to check if the string has 3 hyphens than replace those 3 hyphens with [space][hyphen][space]. And than replace all the remaining single hyphens between the words with space.

First of all, there is always a very naive solution to this kind of problem: you replace your specific matches in context with some chars that are not usually used in the current environment and after replacing generic substrings you may replace the temporary substrings with the necessary exception.
var name = url.Replace("---", "[ \uFFFD ]").Replace("-", " ").Replace("[ \uFFFD ]", " - ");
You may also use a regex based replacement that matches either a 3-hyphen substring capturing it, or just match a single hyphen, and then check if Group 1 matched inside a match evaluator (the third parameter to Regex.Replace can be a Match evaluator method).
It will look like
var name = Regex.Replace(url, #"(---)|-", m => m.Groups[1].Success ? " - " : " ");
See the C# demo.
So, when (---) part matches, the 3 hyphens are put into Group 1 and the .Success property is set to true. Thus, m => m.Groups[1].Success ? " - " : " " replaces 3 hyphens with space+-+space and 1 hyphen (that may be actually 1 of the 2 consecutive hyphens) with a space.

Here's a solution using LINQ rather than Regex:
var str = "the-great-debate---toilet-paper-over-or-under-the-roll";
var result = str.Split(new string[] {"---"}, StringSplitOptions.None)
.Select(s => s.Replace("-", " "))
.Aggregate((c,n) => $"{c} - {n}");
// result = "the great debate - toilet paper over or under the roll"
Split the string up based on the ---, then remove hyphens from each substring, then join them back together.

The easy way:
name = Regex.Replace(name, "\b-|-\b", " ");
The show-off way:
name = Regex.Replace(name, "(\b)?-(?(1)|\b)", " ");

Detecting a word followed by a dot or whitespace using regex

I am using regex and C# to find occurrences of a particular word using
Regex regex = new Regex(#"\b" + word + #"\b");
How can I modify my Regex to only detect the word if it is either preceded with a whitespace, followed with a whitespace or followed with a dot?
Examples:
this.Button.Value - should match
this.value - should match
document.thisButton.Value - should not match

You may use lookarounds and alternation to check for the 2 possibilities when a keyword is enclosed with spaces or is just followed with a dot:
var line = "this.Button.Value\nthis.value\ndocument.thisButton.Value";
var word = "this";
var rx =new Regex(string.Format(#"(?<=\s)\b{0}\b(?=\s)|\b{0}\b(?=\.)", word));
var result = rx.Replace(line, "NEW_WORD");
Console.WriteLine(result);
See IDEONE demo and a regex demo.
The pattern matches:
(?<=\s)\bthis\b(?=\s) - whole word "this" that is preceded with whitespace (?<=\s) and that is followed with whitespace (?=\s)
| - or
\bthis\b(?=\.) - whole word "this" that is followed with a literal . ((?=\.))
Since lookarounds are not consuming characters (the regex index remains where it was) the characters matched with them are not placed in the match value, and are thus untouched during the replacement.

If i am understanding you correctly:
Regex regex = new Regex(#"\b" + (word " " || ".") + #"\b");

Regex regex = new Regex(#"((?<=( \.))" + word + #"\b)" + "|" + #"(\b" + word + #"[ .])");
However, note that this could cause trouble if word contains characters that have special meanings in Regular Expressions. I'm assuming that word contains alpha-numeric characters only.

The (?<=...) match group checks for preceding and (?=...) checks for following, both without including them in the match.
Regex regex = new Regex(#"(?<=\s)\b" + word + #"\b|\b" + word + #"\b(?=[\s\.])");
EDIT: Pattern updated.
EDIT 2: Online test: http://ideone.com/RXRQM5

Regex works on .NET test site but not in C# environment

Referring to this stackoverflow question:
- Regex Pattern help: finding HTML pattern when nested ASP.NET Eval?
I received an answer to the problem here:
- regexstorm link
The .NET answer that works on the regex .NET testing site does NOT work in my C# Visual Studio environment. Here is the Unit Test for it:
[Test]
public void GetAllHtmlSubsectionsWorksAsExpected()
{
var regPattern = new Regex(#"(?'o'<)(.*)(?'-o'>)+");
var html =
"<%# Page Language=\"C#\" %>" +
"<td class=\"c1 c2 c3\" colspan=\"2\">" +
"lorem ipsum" +
"<div class=\"d1\" id=\"div2\" attrid=\"<%# Eval(\"CategoryID\") %>\">" +
"testing 123" +
"</div>" +
"asdf" +
"</td>";
List<string> results = new List<string>();
MatchCollection matches = regPattern.Matches(html);
for (int mnum = 0; mnum < matches.Count; mnum++)
{
Match match = matches[mnum];
results.Add("Match #" + (mnum + 1) + " - Value: " + match.Value);
}
Assert.AreEqual(5, results.Count()); //Fails: results.Count() == 1
}
Why does this work on the regexstorm website but not in my unit test?

NOTE that parsing HTML with regex is not a best practice, you should use a dedicated parser.
Now, as for the question itself, the pattern you use will work only with lines having 1 single substring starting with < and ending with corresponding >. However, your input string has no newline characters! It looks like:
<%# Page Language="C#" %><td class="c1 c2 c3" colspan="2">lorem ipsum<div class="d1" id="div2" attrid="<%# Eval("CategoryID") %>">testing 123</div>asdf</td>
The .* subpattern is called a greedy dot matching pattern, and it matches as many characters other than a newline as possible (because it grabs the whole line and then backtracks to see if the next subpattern (here, >) is found, thus you get the last possible >).
To fix that, you need a proper balanced construct matching pattern:
<((?>[^<>]+|<(?<c>)|>(?<-c>))*(?(c)(?!)))>
See regex demo
C#:
var r = new Regex(#"
< # First '<'
( # Capturing group 1
(?> # Atomic group start
[^<>] # Match all characters other than `<` or `>`
|
< (?<c>) # Match '<', and add a capture into group 'c'
|
> (?<-c>) # Match '>', and delete 1 value from capture stack
)*
(?(c)(?!)) # Fails if 'c' stack isn't empty!
)
> # Last closing `>`
"; RegexOptions.IgnoreWhitespace);
DISCLAIMER: Even this regex will fail if you have unpaired < or > in your element nodes, that is why do not use regex to parse HTML.

There are two different things in regex: Matching and capturing.
What you want here is the capturing group 1.
So you need to use this:
results.Add("Match #" + (mnum + 1) + " - Value: " + match.Groups[1].Value);
Also, as the other answer pointed, you are missing new lines, and regex captures it all in first match.

C# RegEx to find a specific string or all words in a string

Looking it up, I thought I understood how to look up a string of multiple words in a sentence, but it does not find a match. Can someone tell me what I am doing wrong? I need to be able to find a single or multiple word match. I passed in "to find" to the method and it did not find the match. Also, if the user does not enclose their search phrase in quotes, I also need it to search on each word entered.
var pattern = #"\b\" + searchString + #"\b"; //searchString is passed in.
Regex rgx = new Regex(pattern);
var sentence = "I need to find a string in this sentence!";
Match match = rgx.Match(sentence);
if (match.Success)
{
// Do something with the match.
}

Just remove the second \ in the first #"\b\":
var pattern = #"\b" + searchString + #"\b";
^
See IDEONE demo
Note that in case you have special regex metacharacters (like (, ), [, +, *, etc.) in your searchStrings, you can use Regex.Escape() to escape them:
var pattern = #"\b" + Regex.Escape(searchString) + #"\b";
And if those characters may appear in edge positions, use lookarounds rather than word boundaries:
var pattern = #"(?<!\w)" + searchString + #"(?=\w)";

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Regex tokenize issue - c#

Related

C# Regular Expression Look Ahead S

using Regex to iterate over a string and search for 3 consecutive hyphens and replace it with [space][hyphen][space]

Detecting a word followed by a dot or whitespace using regex

Regex works on .NET test site but not in C# environment

C# RegEx to find a specific string or all words in a string

Categories

Resources