Highlight search terms in a string - c#

I have written a function which works with my sites search functionality. When the user searches a word, I perform a replace on the returned search content to take any word that the user entered into the search, and wrap it in span tags with a custom class which will basically bold the word on the page. After overcoming my first road block of having to incorporate case-insensitive replacements, I'm now stuck in another predicament. The word that is replaced on the page is being replaced with the users provided case on the page which looks funny because the content returned is a lot of legal text and acronyms. If a user were to search "rpC 178", the "RPC 178" in the content is displayed as bold and the same case "rpC 178". My first thought was to split the content by "space" and keep a temporary copy of the replaced word before it's replaced in order to preserve it's current case but some of these content blocks can be upwards of 4000 words so that seems inefficient. Am I going about this the wrong way?
Here is my current code:
public static String HighlightWords(String content, String className, String searchTerms)
{
string[] terms = new string[] { };
if (!string.IsNullOrWhiteSpace(searchTerms))
{
terms = searchTerms.Split(new char[] { ' ' }, StringSplitOptions.RemoveEmptyEntries);
}
if (terms == null || terms.Length == 0)
{
return content;
}
var optimizedContent = new StringBuilder(content);
var startHtml = string.Format("<span class=\"{0}\">", className);
var endHtml = "</span>";
string result = string.Empty;
foreach (var term in terms)
{
result = Regex.Replace(optimizedContent.ToString(), term, string.Format("{0}" + term + "{1}", startHtml, endHtml), RegexOptions.Compiled | RegexOptions.IgnoreCase);
}
return result;
}

You can use the other overload of the Regex.Replace method that accepts a MatchEvaluator delegate. Here you pass a method that gets the actual text found as a parameter and can dynamically build the string to use as a replacement.
Sample:
string output = Regex.Replace(input, term,
match => startHtml + match.Value + endHtml,
RegexOptions.Compiled | RegexOptions.IgnoreCase);
Note that the notation with the => symbol may not work with older versions of C#. In this case you have to use the longer form:
string output = Regex.Replace(input, term, new MatchEvaluator(match =>
{
return startHtml + match.Value + endHtml;
}),
RegexOptions.Compiled | RegexOptions.IgnoreCase);
So you can also improve your code because you do not need a foreach loop over all the specified search terms. Just build a regular expression that contains all the terms to look for and then use that for searching.
Remember to use Regex.Escape() to escape the data entered by the user before using it for searching with the Regex class, so that everything works as expected when the user enters characters that have a special meaning in regular expressions.

Related

Parsing mathematical expressions in C#

As a project, I want to write a parser for mathematical expressions in C#. I know there are libraries for this, but want to create my own to learn about this topic.
As an example, I have the expression
min(3,4) + 2 - abs(-4.6)
I then create token from this string by specifying regular expressions and going through the expression from the user trying to match one of the regex. This is done from the front to the back:
private static List<string> Tokenize(string expression)
{
List<string> result = new List<string>();
List<string> tokens = new List<string>();
tokens.Add("^\\(");// matches opening bracket
tokens.Add("^([\\d.\\d]+)"); // matches floating point numbers
tokens.Add("^[&|<=>!]+"); // matches operators and other special characters
tokens.Add("^[\\w]+"); // matches words and integers
tokens.Add("^[,]"); // matches ,
tokens.Add("^[\\)]"); // matches closing bracket
while (0 != expression.Length)
{
bool foundMatch = false;
foreach (string token in tokens)
{
Match match = Regex.Match(expression, token);
if (false == match.Success)
{
continue;
}
result.Add(match.Value);
expression = Regex.Replace(expression, token, "");
foundMatch = true;
break;
}
if (false == foundMatch)
{
break;
}
}
return result;
}
This works quite well. Now I want the user to be able to enter strings into the expression. I found a question to this at Regex tokenize issue however the answer provide regex which match the text anywhere in the expression. However I need this to match only the first occurrence at the front of the expression so I can keep the order of token.
As an example see this:
5 + " is smaller than " + 10
should give me the tokens
5 + " is greater than " + 10
If possible I would also like to be able to enter escape characters so the user is able to use the character " in strings, like "This is an apostrophe \" " gives me the token "This is an apostrophe " "
The answer from Wiktor Stribiżew at that question looked really good, but I couldn't modify it so it only matches at the beginning and only one word. Help is appreciated!
Funny you referencing that question. I actually adopted (yet again) my answer in there to work for you here ;)
Here's a fiddle showing the solution.
The regex is
(?!\+)(?:"((?:\\"|[^"])*)"?)
I changed the code to use capture groups to be able to in a simple manner not add the surrounding quotes. Also the loop removes the + sign separating the tokens.
Regards

c# extracting a certain value within a string

I'm trying to remove a certain bit of text within a string.
Say the string I have contains html elements, like paragraph tags, I created some sort of tokens that will be identified with "{" at the beginning and "}" at the end.
So essentially the string I have would look like this:
text = "<p>{token}</p><p> text goes here {token3}</p>"
I'm wondering is there a way to extract all the words including the "{}" using C#-Code within the string.
Whilst each token could be different to the next, that is why i must use "{" and "}" to identify them as seen below
At the moment I'm got to this code:
var newWord = text.Contains("{") && word.Contains("}")
Something like
var r = new Regex("({.*?})");
foreach(var match in r.Matches(myString)) ...
The ? means that your regex is non-greedy. If you omit it you´ll simply get everythinbg between the first { and the last }.
Alternativly you may also use this:
var index = text.IndexOf("{");
while (index != -1)
{
var end = text.IndexOf("}", index);
result.Add(text.Substring(index, end - index + 1));
index = text.IndexOf("{", index + 1);
}
I would just use a regex for this:
Regex reg = new Regex("{.*?}");
var results = reg.Matches(text);
The regex searches for any characters between { and }.
The .*? means match any character but in a non greedy way. So it will search for the shortest possible string between braces.

A string replace function with support of custom wildcards and escaping these wildcards in C#

I need to write a string replace function with custom wildcards support. I also should be able to escape these wildcards. I currently have a wildcard class with Usage, Value and Escape properties.
So let's say I have a global list called Wildcards. Wildcards has only one member added here:
Wildcards.Add(new Wildcard
{
Usage = #"\Break",
Value = Enviorement.NewLine,
Escape = #"\\Break"
});
So I need a CustomReplace method to do the trick. I should replace the specified parameter in a given string with another one just like the string.Replace. The only difference here that it must use my custom wildcards.
string test = CustomReplace("Hi there! What's up?", "! ", "!\\Break");
// Value of the test variable should be: "Hi there!\r\nWhat's up?"
// Because \Break is specified in a custom wildcard in Wildcards
// But if I use the value of the wildcard's Escape member,
// it should be replaced with the value of Usage member.
test = CustomReplace("Hi there! What's up?", "! ", "!\\\\Break");
// Value of the test variable should be: "Hi there!\\BreakWhat's up?"
My current method doesn't support escape strings.
It also can't be good when it comes to performance since I call string.Replace two times and each one searches the whole string, I guess.
// My current method. Has no support for escape strings.
CustomReplace(string text, string oldValue, string newValue)
{
string done = text.Replace(oldValue, newValue);
foreach (Wildcard wildcard in Wildcards)
{
// Doing this:
// done = done.Replace(wildcard.Escape, wildcard.Usage);
// ...would cause trouble when Escape contains Usage.
done = done.Replace(wildcard.Usage, wildcard.Value);
}
return done;
}
So, do I have to write a replace method which searches the string char by char with the logic to find and seperate both Usage and Escape values, then replace Escape with Usage while replacing Usage with another given string?
Or do you know an already written one?
Can I use regular expressions in this scenerio?
If I can, how? (Have no experience in this, a pattern would be nice)
If I do, would it be faster or slower than char by char searching?
Sorry for the long post, I tried to keep it clear and sorry for any typos and such; it's not my primary language. Thanks in advance.
You can try this:
public string CustomReplace(string text, string oldValue, string newValue)
{
string done = text.Replace(oldValue, newValue);
var builder = new StringBuilder();
foreach (var wildcard in Wildcards)
{
builder.AppendFormat("({0}|{1})|", Regex.Escape(wildcard.Usage),
Regex.Escape(wildcard.Escape));
}
builder.Length = builder.Length - 1; // Remove the last '|' character
return Regex.Replace(done, builder.ToString(), WildcardEvaluator);
}
private string WildcardEvaluator(Match match)
{
var wildcard = Wildcards.Find(w => w.Usage == match.Value);
if (wildcard != null)
return wildcard.Value;
else
return match.Value;
}
I think this is the easiest and fastest solution as there is only one Replace method call for all wildcards.
So if you are happy to just use Regex to fulfil your needs then you should check out this link. It has some great info for using in .Net. The website also has loads of examples on who to construct Regex patterns for many different needs.
A basic example of a Replace on a string with wildcards might look like this...
string input = "my first regex replace";
string result = System.Text.RegularExpressions.Regex.Replace(input, "rep...e", "result");
//result is now "my first regex result"
notice how the second argument in the Replace function takes a regex pattern string. In this case, the dots are acting as a wildcard character, they basically mean "match any single character"
Hopefully this will help you get what you need.
If you define a pattern for both your wildcard and your escape method, you can create a Regex which will find all the wildcards in your text. You can then use a MatchEvaluator to replace them.
class Program
{
static Dictionary<string, string> replacements = new Dictionary<string, string>();
static void Main(string[] args)
{
replacements.Add("\\Break", Environment.NewLine);
string template = #"This is an \\Break escaped newline and this should \Break contain a newline.";
// (?<=($|[^\\])(\\\\){0,}) will handle double escaped items
string outcome = Regex.Replace(template, #"(?<=($|[^\\])(\\\\){0,})\\\w+\b", ReplaceMethod);
}
public static string ReplaceMethod(Match m)
{
string replacement = null;
if (replacements.TryGetValue(m.Value, out replacement))
{
return replacement;
}
else
{
//return string.Empty?
//throw new FormatException()?
return m.Value;
}
}
}

Increasing Regex Efficiency

I have about 100k Outlook mail items that have about 500-600 chars per Body. I have a list of 580 keywords that must search through each body, then append the words at the bottom.
I believe I've increased the efficiency of the majority of the function, but it still takes a lot of time. Even for 100 emails it takes about 4 seconds.
I run two functions for each keyword list (290 keywords each list).
public List<string> Keyword_Search(HtmlNode nSearch)
{
var wordFound = new List<string>();
foreach (string currWord in _keywordList)
{
bool isMatch = Regex.IsMatch(nSearch.InnerHtml, "\\b" + #currWord + "\\b",
RegexOptions.IgnoreCase);
if (isMatch)
{
wordFound.Add(currWord);
}
}
return wordFound;
}
Is there anyway I can increase the efficiency of this function?
The other thing that might be slowing it down is that I use HTML Agility Pack to navigate through some nodes and pull out the body (nSearch.InnerHtml). The _keywordList is a List item, and not an array.
I assume that the COM call nSearch.InnerHtml is pretty slow and you repeat the call for every single word that you are checking. You can simply cache the result of the call:
public List<string> Keyword_Search(HtmlNode nSearch)
{
var wordFound = new List<string>();
// cache inner HTML
string innerHtml = nSearch.InnerHtml;
foreach (string currWord in _keywordList)
{
bool isMatch = Regex.IsMatch(innerHtml, "\\b" + #currWord + "\\b",
RegexOptions.IgnoreCase);
if (isMatch)
{
wordFound.Add(currWord);
}
}
return wordFound;
}
Another optimization would be the one suggested by Jeff Yates. E.g. by using a single pattern:
string pattern = #"(\b(?:" + string.Join("|", _keywordList) + #")\b)";
I don't think this is a job for regular expressions. You might be better off searching each message word by word and checking each word against your word list. With the approach you have, you're searching each message n times where n is the number of words you want to find - it's no wonder that it takes a while.
Most of the time comes form matches that fail, so you want to minimize failures.
If the search keyword are not frequent, you can test for all of them at the same time (with regexp \b(aaa|bbb|ccc|....)\b), then you exclude the emails with no matches. The one that have at least one match, you do a thorough search.
one thing you can easily do is match agaist all the words in one go by building an expression like:
\b(?:word1|word2|word3|....)\b
Then you can precompile the pattern and reuse it to look up all occurencesfor each email (not sure how you do this with .Net API, but there must be a way).
Another thing is instead of using the ignorecase flag, if you convert everything to lowercase, that might give you a small speed boost (need to profile it as it's implementation dependent). Don't forget to warm up the CLR when you profile.
This may be faster. You can leverage Regex Groups like this:
public List<string> Keyword_Search(HtmlNode nSearch)
{
var wordFound = new List<string>();
// cache inner HTML
string innerHtml = nSearch.InnerHtml;
string pattern = "(\\b" + string.Join("\\b)|(\\b", _keywordList) + "\\b)";
Regex myRegex = new Regex(pattern, RegexOptions.IgnoreCase);
MatchCollection myMatches = myRegex.Matches(innerHtml);
foreach (Match myMatch in myMatches)
{
// Group 0 represents the entire match so we skip that one
for (int i = 1; i < myMatch.Groups.Count; i++)
{
if (myMatch.Groups[i].Success)
wordFound.Add(_keywordList[i-1]);
}
}
return wordFound;
}
This way you're only using one regular expression. And the indices of the Groups should correlate with your _keywordList by an offset of 1, hence the line wordFound.Add(_keywordList[i-1]);
UPDATE:
After looking at my code again I just realized that putting the matches into Groups is really unnecessary. And Regex Groups have some overhead. Instead, you could remove the parenthesis from the pattern, and then simply add the matches themselves to the wordFound list. This would produce the same effect, but it'd be faster.
It'd be something like this:
public List<string> Keyword_Search(HtmlNode nSearch)
{
var wordFound = new List<string>();
// cache inner HTML
string innerHtml = nSearch.InnerHtml;
string pattern = "\\b(?:" + string.Join("|", _keywordList) + ")\\b";
Regex myRegex = new Regex(pattern, RegexOptions.IgnoreCase);
MatchCollection myMatches = myRegex.Matches(innerHtml);
foreach (Match myMatch in myMatches)
{
wordFound.Add(myMatch.Value);
}
return wordFound;
}
Regular expressions can be optimized quite a bit when you just want to match against a fixed set of constant strings. Instead of several matches, e.g. against "winter", "win" or "wombat", you can just match against "w(in(ter)?|ombat)", for example (Jeffrey Friedl's book can give you lots of ideas like this). This kind of optimisation is also built into some programs, notably emacs ('regexp-opt'). I'm not too familiar with .NET, but I assume someone has programmed similar functionality - google for "regexp optimization".
If the regular expression is indeed the bottle neck, and even optimizing it (by concatenating the search words to one expression) doesn’t help, consider using a multi-pattern search algorithm, such as Wu-Manber.
I’ve posted a very simple implementation here on Stack Overflow. It’s written in C++ but since the code is straightforward it should be easy to translate it to C#.
Notice that this will find words anywhere, not just at word boundaries. However, this can be easily tested after you’ve checked whether the text contains any words; either once again with a regular expression (now you only test individual emails – much faster) or manually by checking the characters before and after the individual hits.
If your problem is about searching for outlook items containing certain string, you should get a gain from using outlooks search facilities...
see:
http://msdn.microsoft.com/en-us/library/bb644806.aspx
If your keyword search is straight literals, ie do not contain further regex pattern matches, then other method may be more appropriate. The following code demonstrates one such method, this code only goes through each email once, your code went through each email 290 time( twice)
public List<string> FindKeywords(string emailbody, List<string> keywordList)
{
// may want to clean up the input a bit, such as replacing '.' and ',' with a space
// and remove double spaces
string emailBodyAsUppercase = emailbody.ToUpper();
List<string> emailBodyAsList = new List<string>(emailBodyAsUppercase.Split(' '));
List<string> foundKeywords = new List<string>(emailBodyAsList.Intersect(keywordList));
return foundKeywords;
}
If you can use .Net 3.5+ and LINQ you could do something like this.
public static class HtmlNodeTools
{
public static IEnumerable<string> MatchedKeywords(
this HtmlNode nSearch,
IEnumerable<string> keywordList)
{
//// as regex
//var innerHtml = nSearch.InnerHtml;
//return keywordList.Where(kw =>
// Regex.IsMatch(innerHtml,
// #"\b" + kw + #"\b",
// RegexOptions.IgnoreCase)
// );
//would be faster if you don't need the pattern matching
var innerHtml = ' ' + nSearch.InnerHtml + ' ';
return keywordList.Where(kw => innerHtml.Contains(kw));
}
}
class Program
{
static void Main(string[] args)
{
var keyworkList = new string[] { "hello", "world", "nomatch" };
var h = new HtmlNode()
{
InnerHtml = "hi there hello other world"
};
var matched = h.MatchedKeywords(keyworkList).ToList();
//hello, world
}
}
... reused regex example ...
public static class HtmlNodeTools
{
public static IEnumerable<string> MatchedKeywords(
this HtmlNode nSearch,
IEnumerable<KeyValuePair<string, Regex>> keywordList)
{
// as regex
var innerHtml = nSearch.InnerHtml;
return from kvp in keywordList
where kvp.Value.IsMatch(innerHtml)
select kvp.Key;
}
}
class Program
{
static void Main(string[] args)
{
var keyworkList = new string[] { "hello", "world", "nomatch" };
var h = new HtmlNode()
{
InnerHtml = "hi there hello other world"
};
var keyworkSet = keyworkList.Select(kw =>
new KeyValuePair<string, Regex>(kw,
new Regex(
#"\b" + kw + #"\b",
RegexOptions.IgnoreCase)
)
).ToArray();
var matched = h.MatchedKeywords(keyworkSet).ToList();
//hello, world
}
}

Google-like search query tokenization & string splitting

I'm looking to tokenize a search query similar to how Google does it. For instance, if I have the following search query:
the quick "brown fox" jumps over the "lazy dog"
I would like to have a string array with the following tokens:
the
quick
brown fox
jumps
over
the
lazy dog
As you can see, the tokens preserve the spaces with in double quotes.
I'm looking for some examples of how I could do this in C#, preferably not using regular expressions, however if that makes the most sense and would be the most performant, then so be it.
Also I would like to know how I could extend this to handle other special characters, for example, putting a - in front of a term to force exclusion from a search query and so on.
So far, this looks like a good candidate for RegEx's. If it gets significantly more complicated, then a more complex tokenizing scheme may be necessary, but your should avoid that route unless necessary as it is significantly more work. (on the other hand, for complex schemas, regex quickly turns into a dog and should likewise be avoided).
This regex should solve your problem:
("[^"]+"|\w+)\s*
Here is a C# example of its usage:
string data = "the quick \"brown fox\" jumps over the \"lazy dog\"";
string pattern = #"(""[^""]+""|\w+)\s*";
MatchCollection mc = Regex.Matches(data, pattern);
foreach(Match m in mc)
{
string group = m.Groups[0].Value;
}
The real benefit of this method is it can be easily extened to include your "-" requirement like so:
string data = "the quick \"brown fox\" jumps over " +
"the \"lazy dog\" -\"lazy cat\" -energetic";
string pattern = #"(-""[^""]+""|""[^""]+""|-\w+|\w+)\s*";
MatchCollection mc = Regex.Matches(data, pattern);
foreach(Match m in mc)
{
string group = m.Groups[0].Value;
}
Now I hate reading Regex's as much as the next guy, but if you split it up, this one is quite easy to read:
(
-"[^"]+"
|
"[^"]+"
|
-\w+
|
\w+
)\s*
Explanation
If possible match a minus sign, followed by a " followed by everything until the next "
Otherwise match a " followed by everything until the next "
Otherwise match a - followed by any word characters
Otherwise match as many word characters as you can
Put the result in a group
Swallow up any following space characters
I was just trying to figure out how to do this a few days ago. I ended up using Microsoft.VisualBasic.FileIO.TextFieldParser which did exactly what I wanted (just set HasFieldsEnclosedInQuotes to true). Sure it looks somewhat odd to have "Microsoft.VisualBasic" in a C# program, but it works, and as far as I can tell it is part of the .NET framework.
To get my string into a stream for the TextFieldParser, I used "new MemoryStream(new ASCIIEncoding().GetBytes(stringvar))". Not sure if this is the best way to do it.
Edit: I don't think this would handle your "-" requirement, so maybe the RegEx solution is better
Go char by char to the string like this: (sort of pseudo code)
array words = {} // empty array
string word = "" // empty word
bool in_quotes = false
for char c in search string:
if in_quotes:
if c is '"':
append word to words
word = "" // empty word
in_quotes = false
else:
append c to word
else if c is '"':
in_quotes = true
else if c is ' ': // space
if not empty word:
append word to words
word = "" // empty word
else:
append c to word
// Rest
if not empty word:
append word to words
I was looking for a Java solution to this problem and came up with a solution using #Michael La Voie's. Thought I would share it here despite the question being asked for in C#. Hope that's okay.
public static final List<String> convertQueryToWords(String q) {
List<String> words = new ArrayList<>();
Pattern pattern = Pattern.compile("(\"[^\"]+\"|\\w+)\\s*");
Matcher matcher = pattern.matcher(q);
while (matcher.find()) {
MatchResult result = matcher.toMatchResult();
if (result != null && result.group() != null) {
if (result.group().contains("\"")) {
words.add(result.group().trim().replaceAll("\"", "").trim());
} else {
words.add(result.group().trim());
}
}
}
return words;
}

Categories