C# Regex Replace Either Or With Overlapping Match Phrases - c#

I have some text like so:
string[] words = new string[] { "Billy", "Billy Jr.", "party" };
string s = "<p>Billy and Billy Jr. are both coming to the party.</p>";
I want to do a C# regex to highlight the words in the array:
string s = "<p><span>Billy</span> and <span>Billy Jr.</span> are both coming to the <span>party</span>.";
I tried using a foreach loop:
foreach (string word in words)
{
s = Regex.Replace(s, word, "<span>$&</span>", RegexOptions.IgnoreCase);
}
But the problem is, when I do Billy, it will match on Billy Jr. and that phrase will get wrapped twice. How do I accomplish what I want?

Instead of looping and doing a regex three times, you can make one single regex which will match graduately:
(Billy Jr\.|Billy|party)
If you use this, it will match Billy Jr. before Billy, so if the first is found, it will only replace that one.
And the regex101 proof.
To shamelessly steal juharrs comment (which does what I wrote above), you can do this in C#:
s = Regex.Replace
( s
, string.Join
( "|"
, words.OrderByDescending(s => s.Length)
.Select(Regex.Escape)
, "<span>$&</span>"
, RegexOptions.IgnoreCase
);
What it does: it creates a regular expression based on the words array. It first sorts the array on longest first, to prevent the 'Billy' problem. Then it calls Regex.Escape on every word to escape the .. Then it uses the generated regular expression to do the replace.

Related

How can I split a regex into exact words?

I need a little help regarding Regular Expressions in C#
I have the following string
"[[Sender.Name]]\r[[Sender.AdditionalInfo]]\r[[Sender.Street]]\r[[Sender.ZipCode]] [[Sender.Location]]\r[[Sender.Country]]\r"
The string could also contain spaces and theoretically any other characters. So I really need do match the [[words]].
What I need is a text array like this
"[[Sender.Name]]",
"[[Sender.AdditionalInfo]]",
"[[Sender.Street]]",
// ... And so on.
I'm pretty sure that this is perfectly doable with:
var stringArray = Regex.Split(line, #"\[\[+\]\]")
I'm just too stupid to find the correct Regex for the Regex.Split() call.
Anyone here that can tell me the correct Regular Expression to use in my case?
As you can tell I'm not that experienced with RegEx :)
Why dont you split according to "\r"?
and you dont need regex for that just use the standard string function
string[] delimiters = {#"\r"};
string[] split = line.Split(delimiters,StringSplitOptions.None);
Do matching if you want to get the [[..]] block.
Regex rgx = new Regex(#"\[\[.*?\]\]");
foreach (Match m in rgx.Matches(input))
Console.WriteLine(m.Groups[0].Value);
IDEONE
The regex you are using (\[\[+\]\]) will capture: literal [s 2 or more, then 2 literal ]s.
A regex solution is capturing all the non-[s inside doubled [ and ]s (and the string inside the brackets should not be empty, I guess?), and cast MatchCollection to a list or array (here is an example with a list):
var str = "[[Sender.Name]]\r[[Sender.AdditionalInfo]]\r[[Sender.Street]]\r[[Sender.ZipCode]] [[Sender.Location]]\r[[Sender.Country]]\r";
var rgx22 = new Regex(#"\[\[[^]]+?\]\]");
var res345 = rgx22.Matches(str).Cast<Match>().ToList();
Output:

Regular expression to find all words which starts with white space and ends with white space

I need to find words in a string with starting and ending white space. I am finding issues while searching white spaces. However, I could achieve the below. Starts and ends with ##. Any help with whitespaces will be great.
string input = "##12## ##13##";
foreach (Match match in Regex.Matches(input, #"##\b\S+?\b##"))
{
messagebox.show(match.Groups[1].Value);
}
From MSDN doc:
// Define a regular expression for repeated words.
Regex rx = new Regex(#"\b(?<word>\w+)\s+(\k<word>)\b",
RegexOptions.Compiled | RegexOptions.IgnoreCase);
\s+(?=</)
is that expression you're after. It means one or more white-space characters followed by
In my opinion it is betetr to use string.Split() instead of Regex:
var wordsArray = s.Split(new []{' '},StringSplitOptions.RemoveEmptyEntries);
it is better to avoid regex if you can achieve the same result easyer with standard string methods.
i cant exactly get what is in your mind but i hope this code can help you:
string[] ha = input.Split(new[] { '#' }, StringSplitOptions.RemoveEmptyEntries);

C# creating a string which will be parsed, based on user input fails when they enter a tokenizing character

I know what is going on, but i was trying to make it so that my .Split() ignores certain characters.
sample:
1|2|3|This is a string|type:1
the parts "This is a string" is user input The user could enter in a splitting character, | in this case, so i wanted to escape it with \|. It still seems to split based on that. This is being done on the web, so i was thinking that a smart move might actually be just JSON.encode(user_in) to get around it?
1|2|3| This is \|a string|type:1
Still splits on the escaped character because i didnt define it as a special case. How would i get around this issue?
you could use Regex.Split instead and then split on | not preceded by a .
// -- regex for | not preceded by a \
string input = #"1|2|3|This is a string\|type:1";
string pattern = #"(?<!\\)[|]";
string[] substrings = Regex.Split(input, pattern);
foreach (string match in substrings)
{
Console.WriteLine("'{0}'", match);
}
You can replace your delimiter with something special first, next split it and finally replace it back.
var initial = #"1|2|3|This is \| a string|type:1";
var modified = initial.Replace(#"\|", "###");
IEnumerable<string> result = modified.Split('|');
result = result.Select(i => i.Replace("###", #"\|"));

Searching for a RegEx to split a text in it words

I am searching for a RegularExpression to split a text in it words.
I have tested
Regex.Split(text, #"\s+")
But this gives me for example for
this (is a) text. and
this
(is
a)
text
and
But I search for a solution, that gives me only the words - without the (, ), . etc.
It should also split a text like
end.begin
in two words.
Try this:
Regex.Split(text, #"\W+")
\W is the counterpart to \w, which means alpha-numeric.
You're probably better off matching the words rather than splitting.
If you use Split (with \W as Regexident suggested), then you could get an extra string at the beginning and end. For example, the input string (a b) would give you four outputs: "", "a", "b", and another "", because you're using the ( and ) as separators.
What you probably want to do is just match the words. You can do that like this:
Regex.Matches(text, "\\w+").Cast<Match>().Select(match => match.Value)
Then you'll get just the words, and no extra empty strings at the beginning and end.
You can do:
var text = "this (is a) text. and";
// to replace unwanted characters with space
text = System.Text.RegularExpressions.Regex.Replace(text, "[(),.]", " ");
// to split the text with SPACE delimiter
var splitted = text.Split(null as char[], StringSplitOptions.RemoveEmptyEntries);
foreach (var token in splitted)
{
Console.WriteLine(token);
}
See this Demo

C# Parsing Text Within Quotes

I'm developing a simple little search mechanism and I want to allow the user to search for chunks of text with spaces. For example, a user can search for the name of a person:
Name: John Smith
I then "John Smith".Split(' ') into an array of two elements, {"John","Smith"}. I then return all of the records that match "John" AND "Smith" first followed by records that match either "John" OR "Smith." I then return no records for no matches. This isn't a complicated scenario and I have this part working.
I'd now like to be able to allow the user to ONLY return records that match "John Smith"
I'd like to use a basic quote syntax for searching. So if a user wants to search for "John Smith" OR Pocahontas they would enter: "John Smith" Pocahontas. The order of terms is absolutely irrelevant; "John Smith" does not receive priority over Pocahontas because he comes first in the list.
I have two main trains of thought on how I should parse the input.
A) Using regular expression then parsing stuff (IndexOf, Split)
B) Using only the parsing methods
I think a logical point of action would be to find the stuff in quotes; then remove it from the original string and insert it into a separate list. Then all the stuff left over from the original string could be split on the space and inserted into that separate list. If there is either 1 quote or an odd number, it is simply removed from the list.
How do I find matches the from within regex? I know about regex.Replace, but how would I iterate through the matches and insert them into a list. I know there is some neat way to do this using the MatchEvaluator delegate and linq, but I know basically nothing about regex in c#.
EDIT: Came back to this tab withou refreshing and didn't realize this question was already answered... accepted answer is better.
I think pulling out the stuff in quotes first with regex is a good idea. Maybe something like this:
String sampleInput = "\"John Smith\" Pocahontas Bambi \"Jane Doe\" Aladin";
//Create regex pattern
Regex regex = new Regex("\"([^\".]+)\"");
List<string> searches = new List<string>();
//Loop through all matches from regex
foreach (Match match in regex.Matches(sampleInput))
{
//add the match value for the 2nd group to the list
//(1st group is the entire match)
//(2nd group is the first parenthesis group in the defined regex pattern
// which in this case is the text inside the quotes)
searches.Add(match.Groups[1].Value);
}
//remove the matches from the input
sampleInput = regex.Replace(sampleInput, String.Empty);
//split the remaining input and add the result to our searches list
searches.AddRange(sampleInput.Split(new char[] {' '}, StringSplitOptions.RemoveEmptyEntries));
I needed the same functionality as Shawn but I didn't want to use regex. Here is a simple solution that I came up with uses Split() instead of regex for anyone else needing this functionality.
This works because the Split method, by default, will create empty entries in the array for consecutive search values in the source string. If we split on the quote character then the result is an array where the even indexed entries are individual words and the odd indexed entries will be the quotes phrases.
Example:
“John Smith” Pocahontas
Results in
item(0) = (empty string)
item(1) = John Smith
item(2) = Pocahontas
And
1 2 “3 4” 5 “6 7” “8 9”
Results in
item(0) = 1 2
item(1) = 3 4
item(2) = 5
item(3) = 6 7
item(4) = (empty string)
item(5) = 8 9
Note that an unmatched quote will result in a phrase from the last quote to the end of the input string.
public static List<string> QueryToTerms(string query)
{
List<string> Result = new List<string>();
// split on the quote token
string[] QuoteTerms = query.Split('"');
// switch to denote if the current loop is processing words or a phrase
bool WordTerms = true;
foreach (string Item in QuoteTerms)
{
if (!string.IsNullOrWhiteSpace(Item))
if (WordTerms)
{
// Item contains words. parse them and ignore empty entries.
string[] WTerms = Item.Split(new string[] { " " }, StringSplitOptions.RemoveEmptyEntries);
foreach (string WTerm in WTerms)
Result.Add(WTerm);
}
else
// Item is a phrase.
Result.Add(Item);
// Alternate between words and phrases.
WordTerms = !WordTerms;
}
return Result;
}
Use a regex like this:
string input = "\"John Smith\" Pocahontas";
Regex rx = new Regex(#"(?<="")[^""]+(?="")|[^\s""]\S*");
for (Match match = rx.Match(input); match.Success; match = match.NextMatch()) {
// use match.Value here, it contains the string to be searched
}

Categories