Highlighting whole word in HTML string using C# regexp

Highlighting whole word in HTML string using C# regexp - c#

I wrote a method that highlights keywords in an HTML string. It returns the updated string and a list of the matched keywords.
I would like to match the word if it appears as a whole word or with dashes.
But in case it appears with dashes, the word including the dashes is highlighted and returned.
For example, if the word is locks and the HTML contains He -locks- the door then the dashes around the word are also highlighted:
He <span style=\"background-color:yellow\">-locks-</span> the door.
Instead of:
He -<span style=\"background-color:yellow\">locks</span>- the door.
In addition, the returned list contains -locks- instead of locks.
What can I do to get my expected result?
Here is my code:
private static List<string> FindKeywords(IEnumerable<string> words, bool bHighlight, ref string text)
{
HashSet<String> matchingKeywords = new HashSet<string>(new CaseInsensitiveComparer());
string allWords = "\\b(-)?(" + words.Aggregate((list, word) => list + "|" + word) + ")(-)?\\b";
Regex regex = new Regex(allWords, RegexOptions.Compiled | RegexOptions.IgnoreCase);
foreach (Match match in regex.Matches(text))
{
matchingKeywords.Add(match.Value);
}
if (bHighlight)
{
text = regex.Replace(text, string.Format("<span style=\"background-color:yellow\">{0}</span>", "$0"));
}
return matchingKeywords.ToList();
}

You need to use captured .Groups[2].Value instead of Match.Value because your regex has 3 capturing groups, and the second one contains the keyword that you highlight:
foreach (Match match in regex.Matches(text))
{
matchingKeywords.Add(match.Groups[2].Value);
}
if (bHighlight)
{
text = regex.Replace(text, string.Format("$1<span style=\"background-color:yellow\">{0}</span>$3", "$2"));
}
match.Groups[2].Value is used in the foreach and then $2 is the backreference to the keyword captured in the regex.Replace replacement string. $1 and $3 are the optional hyphens around the highlighted word (captured with (-)?).

Related

Remove Adjacent Space near a Special Character using regex

Using regex want to remove adjacent Space near replacement Character
replacementCharcter = '-'
this._adjacentSpace = new Regex($#"\s*([\{replacementCharacter}])\s*");
MatchCollection replaceCharacterMatch = this._adjacentSpace.Matches(extractedText);
foreach (Match replaceCharacter in replaceCharacterMatch)
{
if (replaceCharacter.Success)
{
cleanedText = Extactedtext.Replace(replaceCharacter.Value, replaceCharacter.Value.Trim());
}
}
Extractedtext = - whi, - ch
cleanedtext = -whi, -ch
expected result : cleanedtext = -whi,-ch

You can use
var Extactedtext = "- whi, - ch";
var replacementCharacter = "-";
var _adjacentSpace = new Regex($#"\s*({Regex.Escape(replacementCharacter)})\s*");
var cleanedText = _adjacentSpace.Replace(Extactedtext, "$1");
Console.WriteLine(cleanedText); // => -whi,-ch
See the C# demo.
NOTE:
replacementCharacter is of type string in the code above
$#"\s*({Regex.Escape(replacementCharacter)})\s*" will create a regex like \s*-\s*, Regex.Escape() will escape any regex-special char (like +, (, etc.) correctly to be used in a regex pattern, and the whole regex simply matches (and captured into Group 1 with the capturing parentheses) the replacementCharacter enclosed with zero or more whitespaces
No need using Regex.Matches, just replace all matches if there are any, that is how Regex.Replace works.
_adjacentSpace is the compiled Regex object, to replace, just call the .Replace() method of the regex object instance
The replacement is a backreference to the Group 1 value, the - char here.

Find hashtags in string

I am working on a Xamarin.Forms PCL project in C# and would like to detect all the hashtags.
I tried splitting at spaces and checking if the word begins with an # but the problem is if the post contains two spaces like "Hello #World Test" it would lose that the double space
string body = "Example string with a #hashtag in it";
string newbody = "";
foreach (var word in body.Split(' '))
{
if (word.StartsWith("#"))
newbody += "[" + word + "]";
newbody += word;
}
Goal output:
Example string with a [#hashtag] in it
I also only want it to have A-Z a-z 0-9 and _ stopping at any other character
Test #H3ll0_W0rld$%Test => Test [#H3ll0_W0rld]$%Test
Other Stack questions try to detect the string and extract it, I would like it work with it and put it back in the string without losing anything that methods such as splitting by certain characters would lose.

You can use Regex with #\w+ and $&
Explanation
# matches the character # literally (case sensitive)
\w+ matches any word character (equal to [a-zA-Z0-9_])
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)
$& Includes a copy of the entire match in the replacement string.
Example
var input = "asdads sdfdsf #burgers, #rabbits dsfsdfds #sdf #dfgdfg";
var regex = new Regex(#"#\w+");
var matches = regex.Matches(input);
foreach (var match in matches)
{
Console.WriteLine(match);
}
or
var result = regex.Replace(input, "[$&]" );
Console.WriteLine(result);
Ouput
#burgers
#rabbits
#sdf
#dfgdfg
asdads sdfdsf [#burgers], [#rabbits] dsfsdfds [#sdf] [#dfgdfg]
Updated Demo here
Another Example

Use a regular expression: \#\w*
string pattern = "\#\w*";
Regex rgx = new Regex(pattern, RegexOptions.IgnoreCase);
MatchCollection matches = rgx.Matches(input);

Regex MatchEvaluator doesn't work with words contains "_" underscore

I am trying to match and format output regex result. I have a words array e.g:
var resultArray = new List {"new", "new_"}; // notice the word with underscore
But when i try to search a sentence like this:
New Law_Book_with_New_Cover
it does match the with the first word "New" but not the middle one with "New_". here is my code
if (resultArray.Count > 0)
{
string regex = "\\b(?:" + String.Join("|", resultArray.ToArray()) + ")\\b";
MatchEvaluator myEvaluator = new MatchEvaluator(GetHighlightMarkup);
return Regex.Replace(result, regex, myEvaluator, RegexOptions.Compiled | RegexOptions.Multiline | RegexOptions.IgnoreCase);
}
private static string GetHighlightMarkup(Match m)
{
return string.Format("<span class=\"focus\">{0}</span>", m.Value);
}
And yes i did tried escaping the word "\New_" but no luck still.
What am i missing ?

It seems you need to match your items only if they are not enclosed with letters.
You may replace the word boundaries in your regex with lookarounds:
string regex = #"(?<!\p{L})(?:" + String.Join("|", resultArray.ToArray()) + #")(?!\p{L})";
where \p{L} matches any letter, (?<!\p{L}) requires the absence of a letter before the match, and (?!\p{L}) disallows a letter after the match.

Dot word pattern matching

I want to create a regular expression to match a word that begins with a period. The word(s) can exist N times in a string. I want to ensure that the word comes up whether it's at the beginning of a line, the end of a line or somewhere in the middle. The latter part is what I'm having difficulty with.
Here is where I am at so far.
const string pattern = #"(^|(.* ))(?<slickText>\.[a-zA-Z0-9]*)( .*|$)";
public static MatchCollection Find(string input)
{
Regex regex = new Regex(pattern,RegexOptions.IgnoreCase | RegexOptions.Multiline);
MatchCollection collection = regex.Matches(input);
return collection;
}
My test pattern finds .lee and .good. My test pattern fails to find .bruce:
static void Main()
{
MatchCollection results = ClassName.Find("a short stump .bruce\r\nand .lee a small tree\r\n.good roots");
foreach (Match item in results)
{
GroupCollection groups = item.Groups;
Console.WriteLine("{0} ", groups["slickText"].Value);
}
System.Diagnostics.Debug.Assert(results.Count > 0);
}

Maybe you're just looking for \.\w+?
Test:
var s = "a short stump .bruce\r\nand .lee a small tree\r\n.good roots";
Regex.Matches(s, #"\.\w+").Dump();
Result:
Note:
If you don't want to find foo in some.foo (because there's no whitespace between some and .foo), you can use (?<=\W|^)\.\w+ instead.

Bizarrely enough, it seems that with RegexOptions.Multiline, ^ and $ will only additionally match \n, not \r\n.
Thus you get .good because it is preceded by \n which is matched by ^, but you don't get .bruce because it is succeeded by \r which is not matched by $.
You could do a .Replace("\r", "") on the input, or rewrite your expression to take individual lines of input.
Edit: Or replace $ with \r?$ in your pattern to explicitly include the \r; thanks to SvenS for the suggestion.

In your RegEx, a word has to be terminated by a space, but bruce is terminated by \r instead.

I would give this regex a go:
(?:.*?(\.[A-Za-z]+(?:\b|.\s)).*?)+
And change the RegexOptions from Multiline to Singleline - in this mode dot matches all characters including newline.

Find all words without figures using RegEx

I found this code to get all words of a string,
static string[] GetWords(string input)
{
MatchCollection matches = Regex.Matches(input, #"\b[\w']*\b");
var words = from m in matches.Cast<Match>()
where !string.IsNullOrEmpty(m.Value)
select TrimSuffix(m.Value);
return words.ToArray();
}
static string TrimSuffix(string word)
{
int apostrapheLocation = word.IndexOf('\'');
if (apostrapheLocation != -1)
{
word = word.Substring(0, apostrapheLocation);
}
return word;
}
Please describe about the code.
How can I get words without figures?

2 How can I get words without figures?
You'll have to replace \w with [A-Za-z]
So that your RegEx becomes #"\b[A-Za-z']*\b"
And then you'll have to think about TrimSuffix(). The regEx allows apostrophes but TrimSuffix() will extract only the left part. So "it's" will become "it".

In
MatchCollection matches = Regex.Matches(input, #"\b[\w']*\b");
the code is using a regex that will look for any word; \b means border of word and \w is the alpha numerical POSIX class to get everything as letters(with or without graphical accents), numbers and sometimes underscore and the ' is just included in the list along with the alphaNum. So basically that is searching for the begining and the end of the word and selecting it.
then
var words = from m in matches.Cast<Match>()
where !string.IsNullOrEmpty(m.Value)
select TrimSuffix(m.Value);
is a LINQ syntax, where you can do SQL-Like queries inside your code. That code is getting every match from the regex and checking to see if the value is not empty and to get it without spaces. Its also where you can add your figure validation.
and This:
static string TrimSuffix(string word)
{
int apostrapheLocation = word.IndexOf('\'');
if (apostrapheLocation != -1)
{
word = word.Substring(0, apostrapheLocation);
}
return word;
}
is removing the ' of the words who have it and getting just the part that is before it
i.e. for don't word it will get only the don

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Highlighting whole word in HTML string using C# regexp - c#

Related

Remove Adjacent Space near a Special Character using regex

Find hashtags in string

Regex MatchEvaluator doesn't work with words contains "_" underscore

Dot word pattern matching

Find all words without figures using RegEx

Categories

Resources