Run multiple RegEx patterns on single string

Run multiple RegEx patterns on single string - c#

I need to run a C# RegEx match on a string.
Problem is, I'm looking for more than one pattern on a single string, and I cannot find a way to do that with a single run.
For example, in the string
The dog has jumped
I'm looking for "dog" and for "dog has".
I don't know how can I get those two results with one pass.
I've tried to concatenate the pattern with the alteration symbol (|), like that:
(dog|dog has)
But it returned only the first match.
What can I use to get back both the matches?
Thanks!

The regex engine will return the first substring that satisfied the pattern. If you write (dog|dog has), it won't ever be able to match dog has because dog has starts with dog, which is the first alternative. Furthermore, the regex engine won't return overlapping matches.
Here's a convoluted method:
var patterns = new[] { "dog", "dog has" };
var sb = new StringBuilder();
for (var i = 0; i < patterns.Length; i++)
sb.Append(#"(?=(?<p").Append(i).Append(">").Append(patterns[i]).Append("))?");
var regex = new Regex(sb.ToString(), RegexOptions.Compiled);
Console.WriteLine("Pattern: {0}", regex);
var input = "a dog has been seen with another dog";
Console.WriteLine("Input: {0}", input);
foreach (var match in regex.Matches(input).Cast<Match>())
{
for (var i = 0; i < patterns.Length; i++)
{
var group = match.Groups["p" + i];
if (!group.Success)
continue;
Console.WriteLine("Matched pattern #{0}: '{1}' at index {2}", i, group.Value, group.Index);
}
}
This produces the following output:
Pattern: (?=(?<p0>dog))?(?=(?<p1>dog has))?
Input: a dog has been seen with another dog
Matched pattern #0: 'dog' at index 2
Matched pattern #1: 'dog has' at index 2
Matched pattern #0: 'dog' at index 33
Yes, this is an abuse of the regex engine :)
This works by building a pattern using optional lookaheads, which capture the substrings as a side effect, but the pattern otherwise always matches an empty string. So there are n+1 total matches, n being the input length. The patterns cannot contain numbered backreferences, but you can use named backreferences instead.
Also, this can return overlapping matches, as it will try to match all patterns at all string positions.
But you definitely should benchmark this against a manual approach (looping over the patterns and matching each of them separately). I don't expect this to be fast...

You can use one regex pattern to do both.
Pattern: (dog\b has\b)|(dog\b)
I figured out this pattern using the online builder here: enter link description here
Then you can use it in C# with the regex class by doing something like
Regex reg = new Regex("(dog\b has\b)|(dog\b)", RegexOptions.IgnoreCase);
if (reg.IsMatch){
//found dog or dog has
}

Related

How to find repeatable characters

I can't understand how to solve the following problem:
I have input string "aaaabaa" and I'm trying to search for string "aa" (I'm looking for positions of characters)
Expected result is
0 1 2 5
aa aabaa
a aa abaa
aa aa baa
aaaab aa
This problem is already solved by me using another approach (non-RegEx).
But I need a RegEx I'm new to RegEx so google-search can't help me really.
Any help appreciated! Thanks!
P.S.
I've tried to use (aa)* and "\b(\w+(aa))*\w+" but those expressions are wrong

You can solve this by using a lookahead
a(?=a)
will find every "a" that is followed by another "a".
If you want to do this more generally
(\p{L})(?=\1)
This will find every character that is followed by the same character. Every found letter is stored in a capturing group (because of the brackets around), this capturing group is then reused by the positive lookahead assertion (the (?=...)) by using \1 (in \1 there is the matches character stored)
\p{L} is a unicode code point with the category "letter"
Code
String text = "aaaabaa";
Regex reg = new Regex(#"(\p{L})(?=\1)");
MatchCollection result = reg.Matches(text);
foreach (Match item in result) {
Console.WriteLine(item.Index);
}
Output
0
1
2
5

The following code should work with any regular expression without having to change the actual expression:
Regex rx = new Regex("(a)\1"); // or any other word you're looking for.
int position = 0;
string text = "aaaaabbbbccccaaa";
int textLength = text.Length;
Match m = rx.Match(text, position);
while (m != null && m.Success)
{
Console.WriteLine(m.Index);
if (m.Index <= textLength)
{
m = rx.Match(text, m.Index + 1);
}
else
{
m = null;
}
}
Console.ReadKey();
It uses the option to change the start index of a regex search for each consecutive search. The actual problem comes from the fact that the Regex engine, by default, will always continue searching after the previous match. So it will never find a possible match within another match, unless you instruct it to by using a Look ahead construction or by manually setting the start index.
Another, relatively easy, solution is to just stick the whole expression in a forward look ahead:
string expression = "(a)\1"
Regex rx2 = new Regex("(?=" + expression + ")");
MatchCollection ms = rx2.Matches(text);
var indexes = ms.Cast<Match>().Select(match => match.Index);
That way the engine will automatically advance the index by one for every match it finds.
From the docs:
When a match attempt is repeated by calling the NextMatch method, the regular expression engine gives empty matches special treatment. Usually, NextMatch begins the search for the next match exactly where the previous match left off. However, after an empty match, the NextMatch method advances by one character before trying the next match. This behavior guarantees that the regular expression engine will progress through the string. Otherwise, because an empty match does not result in any forward movement, the next match would start in exactly the same place as the previous match, and it would match the same empty string repeatedly.

Try this:
How can I find repeated characters with a regex in Java?
It is in java, but the regex and non-regex way is there. C# Regex is very similar to the Java way.

Why isn't C# following my regex?

I have a C# application that reads a word file and looks for words wrapped in < brackets >
It's currently using the following code and the regex shown.
private readonly Regex _regex = new Regex("([<])([^>]*)([>])", RegexOptions.Compiled);
I've used several online testing tools / friends to validate that the regex works, and my application proves this (For those playing at home, http://wordfiller.codeplex.com)!
My problem is however the regex will also pickup extra rubbish.
E.G
I'm walking on <sunshine>.
will return
sunshine>.
it should just return
<sunshine>
Anyone know why my application refuses to play by the rules?

I don't think the problem is your regex at all. It could be improved somewhat -- you don't need the ([]) around each bracket -- but that shouldn't affect the results. My strong suspicion is that the problem is in your C# implementation, not your regex.
Your regex should split <sunshine> into three separate groups: <, sunshine, and >. Having tested it with the code below, that's exactly what it does. My suspicion is that, somewhere in the C# code, you're appending Group 3 to Group 2 without realizing it. Some quick C# experimentation supports this:
private readonly Regex _regex = new Regex("([<])([^>]*)([>])", RegexOptions.Compiled);
private string sunshine()
{
string input = "I'm walking on <sunshine>.";
var match = _regex.Match(input);
var regex2 = new Regex("<[^>]*>", RegexOptions.Compiled); //A slightly simpler version
string result = "";
for (int i = 0; i < match.Groups.Count; i++)
{
result += string.Format("Group {0}: {1}\n", i, match.Groups[i].Value);
}
result += "\nWhat you're getting: " + match.Groups[2].Value + match.Groups[3].Value;
result += "\nWhat you want: " + match.Groups[0].Value + " or " + match.Value;
result += "\nBut you don't need all those brackets and groups: " + regex2.Match(input).Value;
return result;
}
Result:
Group 0: <sunshine>
Group 1: <
Group 2: sunshine
Group 3: >
What you're getting: sunshine>
What you want: <sunshine> or <sunshine>
But you don't need all those brackets and groups: <sunshine>

We will need to see more code to solve the problem. There is an off by one error somewhere in your code. It is impossible for that regular expression to return sunshine>.. Therefore the regular expression in question is not the problem. I would assume, without more details, that something is getting the index into the string containing your match and it is one character too far into the string.

If all you want is the text between < and > then you'd be better off using:
[<]([^>]*)[>] or simpler: <([^>]+)>
If you want to include < and > then you could use:
([<][^>]*[>]) or simpler: (<[^>]+>)
You're expression currently has 3 Group Matches - indicated by the brackets ().
In the case of < sunshine> this will currently return the following:
Group 1 : "<"
Group 2 : "sunshine"
Group 3 : ">"
So if you only looked at the 2nd group it should work!
The only explanation I can give for your observed behaviour is that where you pull the matches out, you are adding together Groups 2 + 3 and not Group 1.

What you posted works perfectly fine.
Regex _regex = new Regex("([<])([^>]*)([>])", RegexOptions.Compiled);
string test = "I'm walking on <sunshine>.";
var match = _regex.Match(test);
Match is <sunshine> i guess you need to provide more code.

Regex is eager by default. Teach it to be lazy!
What I mean is, the * operator considers as many repetitions as possible (it's said to be eager). Use the *? operator instead, this tells Regex to consider as few repetitions as possible (i.e. to be lazy):
<.*?>

Because you are using parenthesis, you are creating matching groups. This is causing the match collection to match the groups created by the regular expression to also be matched. You can reduce your regular expression to [<][^>]*[>] and it will match only on the <text> that you wish.

c# regex to extract link after =

Couldn't find better title but i need a Regex to extract link from sample below.
snip... flashvars.image_url = 'http://domain.com/test.jpg' ..snip
assuming regex is the best way.
thanks

Consider the following sample code. It shows how one might extract from your supplied string. But I have expanded upon the string some. Generally, the use of .* is too all inclusive (as the example below demonstrates).
The main point, is there are several ways to do what you are asking, the first answer given uses "look-around" while the second suggests the "Groups" approach. The choice mainly depend upon your actual data.
string[] tests = {
#"snip... flashvars.image_url = 'http://domain.com/test.jpg' ..snip",
#"snip... flashvars.image_url = 'http://domain.com/test.jpg' flashvars2.image_url = 'http://someother.domain.com/test.jpg'",
};
string[] patterns = {
#"(?<==\s')[^']*(?=')",
#"=\s*'(.*)'",
#"=\s*'([^']*)'",
};
foreach (string pattern in patterns)
{
Console.WriteLine();
foreach (string test in tests)
foreach (Match m in Regex.Matches(test, pattern))
{
if (m.Groups.Count > 1)
Console.WriteLine("{0}", m.Groups[1].Value);
else
Console.WriteLine("{0}", m.Value);
}
}

A simple regex for this would be #"=\s*'(.*)'".

Edit: New regex matching your edited question:
You need to match what's between quotes, after a =, right?
#"(?<==\s*')[^']*(?=')"
should do.
(?<==\s*') asserts that there is a =, optionally followed by whitespace, followed by a ', just before our current position (positive lookbehind).
[^']* matches any number of non-' characters.
(?=') asserts that the match stops before the next '.
This regex doesn't check if there is indeed a URL inside those quotes. If you want to do that, use
#"(?<==\s*')(?=(?:https?|ftp|mailto)\b)[^']*(?=')"

C# Parsing Text Within Quotes

I'm developing a simple little search mechanism and I want to allow the user to search for chunks of text with spaces. For example, a user can search for the name of a person:
Name: John Smith
I then "John Smith".Split(' ') into an array of two elements, {"John","Smith"}. I then return all of the records that match "John" AND "Smith" first followed by records that match either "John" OR "Smith." I then return no records for no matches. This isn't a complicated scenario and I have this part working.
I'd now like to be able to allow the user to ONLY return records that match "John Smith"
I'd like to use a basic quote syntax for searching. So if a user wants to search for "John Smith" OR Pocahontas they would enter: "John Smith" Pocahontas. The order of terms is absolutely irrelevant; "John Smith" does not receive priority over Pocahontas because he comes first in the list.
I have two main trains of thought on how I should parse the input.
A) Using regular expression then parsing stuff (IndexOf, Split)
B) Using only the parsing methods
I think a logical point of action would be to find the stuff in quotes; then remove it from the original string and insert it into a separate list. Then all the stuff left over from the original string could be split on the space and inserted into that separate list. If there is either 1 quote or an odd number, it is simply removed from the list.
How do I find matches the from within regex? I know about regex.Replace, but how would I iterate through the matches and insert them into a list. I know there is some neat way to do this using the MatchEvaluator delegate and linq, but I know basically nothing about regex in c#.

EDIT: Came back to this tab withou refreshing and didn't realize this question was already answered... accepted answer is better.
I think pulling out the stuff in quotes first with regex is a good idea. Maybe something like this:
String sampleInput = "\"John Smith\" Pocahontas Bambi \"Jane Doe\" Aladin";
//Create regex pattern
Regex regex = new Regex("\"([^\".]+)\"");
List<string> searches = new List<string>();
//Loop through all matches from regex
foreach (Match match in regex.Matches(sampleInput))
{
//add the match value for the 2nd group to the list
//(1st group is the entire match)
//(2nd group is the first parenthesis group in the defined regex pattern
// which in this case is the text inside the quotes)
searches.Add(match.Groups[1].Value);
}
//remove the matches from the input
sampleInput = regex.Replace(sampleInput, String.Empty);
//split the remaining input and add the result to our searches list
searches.AddRange(sampleInput.Split(new char[] {' '}, StringSplitOptions.RemoveEmptyEntries));

I needed the same functionality as Shawn but I didn't want to use regex. Here is a simple solution that I came up with uses Split() instead of regex for anyone else needing this functionality.
This works because the Split method, by default, will create empty entries in the array for consecutive search values in the source string. If we split on the quote character then the result is an array where the even indexed entries are individual words and the odd indexed entries will be the quotes phrases.
Example:
“John Smith” Pocahontas
Results in
item(0) = (empty string)
item(1) = John Smith
item(2) = Pocahontas
And
1 2 “3 4” 5 “6 7” “8 9”
Results in
item(0) = 1 2
item(1) = 3 4
item(2) = 5
item(3) = 6 7
item(4) = (empty string)
item(5) = 8 9
Note that an unmatched quote will result in a phrase from the last quote to the end of the input string.
public static List<string> QueryToTerms(string query)
{
List<string> Result = new List<string>();
// split on the quote token
string[] QuoteTerms = query.Split('"');
// switch to denote if the current loop is processing words or a phrase
bool WordTerms = true;
foreach (string Item in QuoteTerms)
{
if (!string.IsNullOrWhiteSpace(Item))
if (WordTerms)
{
// Item contains words. parse them and ignore empty entries.
string[] WTerms = Item.Split(new string[] { " " }, StringSplitOptions.RemoveEmptyEntries);
foreach (string WTerm in WTerms)
Result.Add(WTerm);
}
else
// Item is a phrase.
Result.Add(Item);
// Alternate between words and phrases.
WordTerms = !WordTerms;
}
return Result;
}

Use a regex like this:
string input = "\"John Smith\" Pocahontas";
Regex rx = new Regex(#"(?<="")[^""]+(?="")|[^\s""]\S*");
for (Match match = rx.Match(input); match.Success; match = match.NextMatch()) {
// use match.Value here, it contains the string to be searched
}

Google-like search query tokenization & string splitting

I'm looking to tokenize a search query similar to how Google does it. For instance, if I have the following search query:
the quick "brown fox" jumps over the "lazy dog"
I would like to have a string array with the following tokens:
the
quick
brown fox
jumps
over
the
lazy dog
As you can see, the tokens preserve the spaces with in double quotes.
I'm looking for some examples of how I could do this in C#, preferably not using regular expressions, however if that makes the most sense and would be the most performant, then so be it.
Also I would like to know how I could extend this to handle other special characters, for example, putting a - in front of a term to force exclusion from a search query and so on.

So far, this looks like a good candidate for RegEx's. If it gets significantly more complicated, then a more complex tokenizing scheme may be necessary, but your should avoid that route unless necessary as it is significantly more work. (on the other hand, for complex schemas, regex quickly turns into a dog and should likewise be avoided).
This regex should solve your problem:
("[^"]+"|\w+)\s*
Here is a C# example of its usage:
string data = "the quick \"brown fox\" jumps over the \"lazy dog\"";
string pattern = #"(""[^""]+""|\w+)\s*";
MatchCollection mc = Regex.Matches(data, pattern);
foreach(Match m in mc)
{
string group = m.Groups[0].Value;
}
The real benefit of this method is it can be easily extened to include your "-" requirement like so:
string data = "the quick \"brown fox\" jumps over " +
"the \"lazy dog\" -\"lazy cat\" -energetic";
string pattern = #"(-""[^""]+""|""[^""]+""|-\w+|\w+)\s*";
MatchCollection mc = Regex.Matches(data, pattern);
foreach(Match m in mc)
{
string group = m.Groups[0].Value;
}
Now I hate reading Regex's as much as the next guy, but if you split it up, this one is quite easy to read:
(
-"[^"]+"
|
"[^"]+"
|
-\w+
|
\w+
)\s*
Explanation
If possible match a minus sign, followed by a " followed by everything until the next "
Otherwise match a " followed by everything until the next "
Otherwise match a - followed by any word characters
Otherwise match as many word characters as you can
Put the result in a group
Swallow up any following space characters

I was just trying to figure out how to do this a few days ago. I ended up using Microsoft.VisualBasic.FileIO.TextFieldParser which did exactly what I wanted (just set HasFieldsEnclosedInQuotes to true). Sure it looks somewhat odd to have "Microsoft.VisualBasic" in a C# program, but it works, and as far as I can tell it is part of the .NET framework.
To get my string into a stream for the TextFieldParser, I used "new MemoryStream(new ASCIIEncoding().GetBytes(stringvar))". Not sure if this is the best way to do it.
Edit: I don't think this would handle your "-" requirement, so maybe the RegEx solution is better

Go char by char to the string like this: (sort of pseudo code)
array words = {} // empty array
string word = "" // empty word
bool in_quotes = false
for char c in search string:
if in_quotes:
if c is '"':
append word to words
word = "" // empty word
in_quotes = false
else:
append c to word
else if c is '"':
in_quotes = true
else if c is ' ': // space
if not empty word:
append word to words
word = "" // empty word
else:
append c to word
// Rest
if not empty word:
append word to words

I was looking for a Java solution to this problem and came up with a solution using #Michael La Voie's. Thought I would share it here despite the question being asked for in C#. Hope that's okay.
public static final List<String> convertQueryToWords(String q) {
List<String> words = new ArrayList<>();
Pattern pattern = Pattern.compile("(\"[^\"]+\"|\\w+)\\s*");
Matcher matcher = pattern.matcher(q);
while (matcher.find()) {
MatchResult result = matcher.toMatchResult();
if (result != null && result.group() != null) {
if (result.group().contains("\"")) {
words.add(result.group().trim().replaceAll("\"", "").trim());
} else {
words.add(result.group().trim());
}
}
}
return words;
}

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Run multiple RegEx patterns on single string - c#

Related

How to find repeatable characters

Why isn't C# following my regex?

c# regex to extract link after =

C# Parsing Text Within Quotes

Google-like search query tokenization & string splitting

Categories

Resources