Increasing Regex Efficiency - c#

I have about 100k Outlook mail items that have about 500-600 chars per Body. I have a list of 580 keywords that must search through each body, then append the words at the bottom.
I believe I've increased the efficiency of the majority of the function, but it still takes a lot of time. Even for 100 emails it takes about 4 seconds.
I run two functions for each keyword list (290 keywords each list).
public List<string> Keyword_Search(HtmlNode nSearch)
{
var wordFound = new List<string>();
foreach (string currWord in _keywordList)
{
bool isMatch = Regex.IsMatch(nSearch.InnerHtml, "\\b" + #currWord + "\\b",
RegexOptions.IgnoreCase);
if (isMatch)
{
wordFound.Add(currWord);
}
}
return wordFound;
}
Is there anyway I can increase the efficiency of this function?
The other thing that might be slowing it down is that I use HTML Agility Pack to navigate through some nodes and pull out the body (nSearch.InnerHtml). The _keywordList is a List item, and not an array.

I assume that the COM call nSearch.InnerHtml is pretty slow and you repeat the call for every single word that you are checking. You can simply cache the result of the call:
public List<string> Keyword_Search(HtmlNode nSearch)
{
var wordFound = new List<string>();
// cache inner HTML
string innerHtml = nSearch.InnerHtml;
foreach (string currWord in _keywordList)
{
bool isMatch = Regex.IsMatch(innerHtml, "\\b" + #currWord + "\\b",
RegexOptions.IgnoreCase);
if (isMatch)
{
wordFound.Add(currWord);
}
}
return wordFound;
}
Another optimization would be the one suggested by Jeff Yates. E.g. by using a single pattern:
string pattern = #"(\b(?:" + string.Join("|", _keywordList) + #")\b)";

I don't think this is a job for regular expressions. You might be better off searching each message word by word and checking each word against your word list. With the approach you have, you're searching each message n times where n is the number of words you want to find - it's no wonder that it takes a while.

Most of the time comes form matches that fail, so you want to minimize failures.
If the search keyword are not frequent, you can test for all of them at the same time (with regexp \b(aaa|bbb|ccc|....)\b), then you exclude the emails with no matches. The one that have at least one match, you do a thorough search.

one thing you can easily do is match agaist all the words in one go by building an expression like:
\b(?:word1|word2|word3|....)\b
Then you can precompile the pattern and reuse it to look up all occurencesfor each email (not sure how you do this with .Net API, but there must be a way).
Another thing is instead of using the ignorecase flag, if you convert everything to lowercase, that might give you a small speed boost (need to profile it as it's implementation dependent). Don't forget to warm up the CLR when you profile.

This may be faster. You can leverage Regex Groups like this:
public List<string> Keyword_Search(HtmlNode nSearch)
{
var wordFound = new List<string>();
// cache inner HTML
string innerHtml = nSearch.InnerHtml;
string pattern = "(\\b" + string.Join("\\b)|(\\b", _keywordList) + "\\b)";
Regex myRegex = new Regex(pattern, RegexOptions.IgnoreCase);
MatchCollection myMatches = myRegex.Matches(innerHtml);
foreach (Match myMatch in myMatches)
{
// Group 0 represents the entire match so we skip that one
for (int i = 1; i < myMatch.Groups.Count; i++)
{
if (myMatch.Groups[i].Success)
wordFound.Add(_keywordList[i-1]);
}
}
return wordFound;
}
This way you're only using one regular expression. And the indices of the Groups should correlate with your _keywordList by an offset of 1, hence the line wordFound.Add(_keywordList[i-1]);
UPDATE:
After looking at my code again I just realized that putting the matches into Groups is really unnecessary. And Regex Groups have some overhead. Instead, you could remove the parenthesis from the pattern, and then simply add the matches themselves to the wordFound list. This would produce the same effect, but it'd be faster.
It'd be something like this:
public List<string> Keyword_Search(HtmlNode nSearch)
{
var wordFound = new List<string>();
// cache inner HTML
string innerHtml = nSearch.InnerHtml;
string pattern = "\\b(?:" + string.Join("|", _keywordList) + ")\\b";
Regex myRegex = new Regex(pattern, RegexOptions.IgnoreCase);
MatchCollection myMatches = myRegex.Matches(innerHtml);
foreach (Match myMatch in myMatches)
{
wordFound.Add(myMatch.Value);
}
return wordFound;
}

Regular expressions can be optimized quite a bit when you just want to match against a fixed set of constant strings. Instead of several matches, e.g. against "winter", "win" or "wombat", you can just match against "w(in(ter)?|ombat)", for example (Jeffrey Friedl's book can give you lots of ideas like this). This kind of optimisation is also built into some programs, notably emacs ('regexp-opt'). I'm not too familiar with .NET, but I assume someone has programmed similar functionality - google for "regexp optimization".

If the regular expression is indeed the bottle neck, and even optimizing it (by concatenating the search words to one expression) doesn’t help, consider using a multi-pattern search algorithm, such as Wu-Manber.
I’ve posted a very simple implementation here on Stack Overflow. It’s written in C++ but since the code is straightforward it should be easy to translate it to C#.
Notice that this will find words anywhere, not just at word boundaries. However, this can be easily tested after you’ve checked whether the text contains any words; either once again with a regular expression (now you only test individual emails – much faster) or manually by checking the characters before and after the individual hits.

If your problem is about searching for outlook items containing certain string, you should get a gain from using outlooks search facilities...
see:
http://msdn.microsoft.com/en-us/library/bb644806.aspx

If your keyword search is straight literals, ie do not contain further regex pattern matches, then other method may be more appropriate. The following code demonstrates one such method, this code only goes through each email once, your code went through each email 290 time( twice)
public List<string> FindKeywords(string emailbody, List<string> keywordList)
{
// may want to clean up the input a bit, such as replacing '.' and ',' with a space
// and remove double spaces
string emailBodyAsUppercase = emailbody.ToUpper();
List<string> emailBodyAsList = new List<string>(emailBodyAsUppercase.Split(' '));
List<string> foundKeywords = new List<string>(emailBodyAsList.Intersect(keywordList));
return foundKeywords;
}

If you can use .Net 3.5+ and LINQ you could do something like this.
public static class HtmlNodeTools
{
public static IEnumerable<string> MatchedKeywords(
this HtmlNode nSearch,
IEnumerable<string> keywordList)
{
//// as regex
//var innerHtml = nSearch.InnerHtml;
//return keywordList.Where(kw =>
// Regex.IsMatch(innerHtml,
// #"\b" + kw + #"\b",
// RegexOptions.IgnoreCase)
// );
//would be faster if you don't need the pattern matching
var innerHtml = ' ' + nSearch.InnerHtml + ' ';
return keywordList.Where(kw => innerHtml.Contains(kw));
}
}
class Program
{
static void Main(string[] args)
{
var keyworkList = new string[] { "hello", "world", "nomatch" };
var h = new HtmlNode()
{
InnerHtml = "hi there hello other world"
};
var matched = h.MatchedKeywords(keyworkList).ToList();
//hello, world
}
}
... reused regex example ...
public static class HtmlNodeTools
{
public static IEnumerable<string> MatchedKeywords(
this HtmlNode nSearch,
IEnumerable<KeyValuePair<string, Regex>> keywordList)
{
// as regex
var innerHtml = nSearch.InnerHtml;
return from kvp in keywordList
where kvp.Value.IsMatch(innerHtml)
select kvp.Key;
}
}
class Program
{
static void Main(string[] args)
{
var keyworkList = new string[] { "hello", "world", "nomatch" };
var h = new HtmlNode()
{
InnerHtml = "hi there hello other world"
};
var keyworkSet = keyworkList.Select(kw =>
new KeyValuePair<string, Regex>(kw,
new Regex(
#"\b" + kw + #"\b",
RegexOptions.IgnoreCase)
)
).ToArray();
var matched = h.MatchedKeywords(keyworkSet).ToList();
//hello, world
}
}

Related

Using wildcards to get URLs from a string

I'm using Visual Studio C# and want to list wildcard instances in a string that are URls. I've used regex with Perl for years, but I just cannot figure it out in C#. For the string, there may not be any or there could be one or more urls.
str = "This has more than one URL http://findme.com/lost and another one named http://www.amidumb.net but then there is this one https://hello.ua/findme/ifyoucan/ at last."
I want to list iamlost.com, www.amidumb.net and hello.ua
This is where I am:
string pattern = #"\/\/(.*)\/(.*)";
Regex r = new Regex(pattern, RegexOptions.IgnoreCase);
Match m = r.Match(newLineItem);
ArrayList results = new ArrayList();
while(m.Success) {
Console.WriteLine(m + "\n");
m = m.NextMatch();
}
When I run the above, it prints the whole line after the first instance of ":" starting with the first occurance of "//".
It seems like I should be able to select the first (.*) after // and each one after that.
I'm gussing I somehow need to add each found instance to a list but I am totally lost. Am I even headed in the right direction?
Here is a simple example to get what you are looking for I believe.
void Main()
{
var rawString = "This has more than one URL http://findme.com/lost and another one named http://www.amidumb.net but then there is this one https://hello.ua/findme/ifyoucan/ at last.";
var urlList = UrlMaker(rawString);
}
// You can define other methods, fields, classes and namespaces here
public List<string> UrlMaker(string input)
{
List<string> urls = new List<string>();
var linkParser = new Regex(#"\b(?:https?://|www\.)\S+\b", RegexOptions.Compiled | RegexOptions.IgnoreCase);
var rawString = input;
foreach (Match m in linkParser.Matches(rawString))
{
if (m.Value.Contains("http"))
{
Uri url = new Uri(m.Value);
urls.Add(url.Host);
}
else
{
urls.Add(m.Value);
}
}
return urls;
}
this code outputs:

Subtitle's Time Editor with Regular Expressions

I have a subtitle in my string
string subtitle = Encoding.ASCII.GetString(srt_text);
srt_text is a byte array. I am converting it to string as you can see. subtitle starts and finish with
Starts:
1
00:00:40,152 --> 00:00:43,614
Out west there was this fella,
2
00:00:43,697 --> 00:00:45,824
fella I want to tell you about,
Finish:
1631
01:52:17,016 --> 01:52:20,019
Catch ya later on
down the trail.
1632
01:52:20,102 --> 01:52:24,440
Say, friend, you got any more
of that good Sarsaparilla?
Now I want to take times and put them into array. I tried
Regex rgx = new Regex(#"^(?:[01][0-9]|2[0-3]):[0-5][0-9]:[0-5][0-9],[0-9][0-9][0-9]$", RegexOptions.IgnoreCase);
Match m = rgx.Match(subtitle);
I am thinking I can just find times but didn't put into array.
Assume 'times' is my string array. I want to array output like that
times[0] = "00:00:40,152"
times[1] = "00:00:43,614"
...
times[n-1] = "01:52:20,102"
times[n] = "01:52:24,440"
It have to keep going when subtitle is finish. All times might be in.
I am open for your advise. How can I do this? I am new probably have a lot of mistakes. I apoligize. Hope you can understand and help me.
Using Regular Expressions
You can do this with Regex with multiple matches using Regex.Matches
The regex used is
(\d{2}:\d{2}:\d{2},\d+)
\d select digits
{2} count of repeatition
+ one or many repeatitions
: and , are plain characters without meaning.
Here is the syntax.
var matchList = Regex.Matches(subtitle, #"(\d{2}:\d{2}:\d{2},\d+)",RegexOptions.Multiline);
var times = matchList.Cast<Match>().Select(match => match.Value).ToList();
With this your times variable will be filled with all the time substrings.
Below is the result screenshot.
Also note: The RegexOptions.Multiline part is optional in this scenario.
Probably this might help you get the times from the string you have.
string subtitle = #"1
00:00:40,152 --> 00:00:43,614
Out west there was this fella,
2
00:00:43,697 --> 00:00:45,824
fella I want to tell you about,";
List<string> timestrings = new List<string>();
List<string> splittedtimestrings = new List<string>();
List<string> splittedstring = subtitle.Split(new string[] { "\r\n" }, StringSplitOptions.RemoveEmptyEntries ).ToList();
foreach(string st in splittedstring)
{
if(st.Contains("00"))
{
timestrings.Add(st);
}
}
foreach(string s in timestrings)
{
string[] foundstr = s.Split(new string[] { " --> " }, StringSplitOptions.RemoveEmptyEntries);
splittedtimestrings.Add(foundstr[0]);
splittedtimestrings.Add(foundstr[1]);
}
I have tried splitting the string to get the time string instead of Regex. Because I think Regex should be used to processes text based on pattern matches rather than on comparing and matching literal text.

Highlight search terms in a string

I have written a function which works with my sites search functionality. When the user searches a word, I perform a replace on the returned search content to take any word that the user entered into the search, and wrap it in span tags with a custom class which will basically bold the word on the page. After overcoming my first road block of having to incorporate case-insensitive replacements, I'm now stuck in another predicament. The word that is replaced on the page is being replaced with the users provided case on the page which looks funny because the content returned is a lot of legal text and acronyms. If a user were to search "rpC 178", the "RPC 178" in the content is displayed as bold and the same case "rpC 178". My first thought was to split the content by "space" and keep a temporary copy of the replaced word before it's replaced in order to preserve it's current case but some of these content blocks can be upwards of 4000 words so that seems inefficient. Am I going about this the wrong way?
Here is my current code:
public static String HighlightWords(String content, String className, String searchTerms)
{
string[] terms = new string[] { };
if (!string.IsNullOrWhiteSpace(searchTerms))
{
terms = searchTerms.Split(new char[] { ' ' }, StringSplitOptions.RemoveEmptyEntries);
}
if (terms == null || terms.Length == 0)
{
return content;
}
var optimizedContent = new StringBuilder(content);
var startHtml = string.Format("<span class=\"{0}\">", className);
var endHtml = "</span>";
string result = string.Empty;
foreach (var term in terms)
{
result = Regex.Replace(optimizedContent.ToString(), term, string.Format("{0}" + term + "{1}", startHtml, endHtml), RegexOptions.Compiled | RegexOptions.IgnoreCase);
}
return result;
}
You can use the other overload of the Regex.Replace method that accepts a MatchEvaluator delegate. Here you pass a method that gets the actual text found as a parameter and can dynamically build the string to use as a replacement.
Sample:
string output = Regex.Replace(input, term,
match => startHtml + match.Value + endHtml,
RegexOptions.Compiled | RegexOptions.IgnoreCase);
Note that the notation with the => symbol may not work with older versions of C#. In this case you have to use the longer form:
string output = Regex.Replace(input, term, new MatchEvaluator(match =>
{
return startHtml + match.Value + endHtml;
}),
RegexOptions.Compiled | RegexOptions.IgnoreCase);
So you can also improve your code because you do not need a foreach loop over all the specified search terms. Just build a regular expression that contains all the terms to look for and then use that for searching.
Remember to use Regex.Escape() to escape the data entered by the user before using it for searching with the Regex class, so that everything works as expected when the user enters characters that have a special meaning in regular expressions.

Get only Whole Words from a .Contains() statement

I've used .Contains() to find if a sentence contains a specific word however I found something weird:
I wanted to find if the word "hi" was present in a sentence which are as follows:
The child wanted to play in the mud
Hi there
Hector had a hip problem
if(sentence.contains("hi"))
{
//
}
I only want the SECOND sentence to be filtered however all 3 gets filtered since CHILD has a 'hi' in it and hip has a 'hi' in it. How do I use the .Contains() such that only whole words get picked out?
Try using Regex:
if (Regex.Match(sentence, #"\bhi\b", RegexOptions.IgnoreCase).Success)
{
//
};
This works just fine for me on your input text.
Here's a Regex solution:
Regex has a Word Boundary Anchor using \b
Also, if the search string might come from user input, you might consider escaping the string using Regex.Escape
This example should filter a list of strings the way you want.
string findme = "hi";
string pattern = #"\b" + Regex.Escape(findme) + #"\b";
Regex re = new Regex(pattern,RegexOptions.IgnoreCase);
List<string> data = new List<string> {
"The child wanted to play in the mud",
"Hi there",
"Hector had a hip problem"
};
var filtered = data.Where(d => re.IsMatch(d));
DotNetFiddle Example
You could split your sentence into words - you could split at each space and then trim any punctuation. Then check if any of these words are 'hi':
var punctuation = source.Where(Char.IsPunctuation).Distinct().ToArray();
var words = sentence.Split().Select(x => x.Trim(punctuation));
var containsHi = words.Contains("hi", StringComparer.OrdinalIgnoreCase);
See a working demo here: https://dotnetfiddle.net/AomXWx
You could write your own extension method for string like:
static class StringExtension
{
public static bool ContainsWord(this string s, string word)
{
string[] ar = s.Split(' ');
foreach (string str in ar)
{
if (str.ToLower() == word.ToLower())
return true;
}
return false;
}
}

Find all substrings between two strings

I need to get all substrings from string.
For ex:
StringParser.GetSubstrings("[start]aaaaaa[end] wwwww [start]cccccc[end]", "[start]", "[end]");
that returns 2 string "aaaaaa" and "cccccc"
Suppose we have only one level of nesting.
Not sure about regexp, but I think it will be userful.
private IEnumerable<string> GetSubStrings(string input, string start, string end)
{
Regex r = new Regex(Regex.Escape(start) + "(.*?)" + Regex.Escape(end));
MatchCollection matches = r.Matches(input);
foreach (Match match in matches)
yield return match.Groups[1].Value;
}
Here's a solution that doesn't use regular expressions and doesn't take nesting into consideration.
public static IEnumerable<string> EnclosedStrings(
this string s,
string begin,
string end)
{
int beginPos = s.IndexOf(begin, 0);
while (beginPos >= 0)
{
int start = beginPos + begin.Length;
int stop = s.IndexOf(end, start);
if (stop < 0)
yield break;
yield return s.Substring(start, stop - start);
beginPos = s.IndexOf(begin, stop+end.Length);
}
}
You can use a regular expression, but remember to call Regex.Escape on your arguments:
public static IEnumerable<string> GetSubStrings(
string text,
string start,
string end)
{
string regex = string.Format("{0}(.*?){1}",
Regex.Escape(start),
Regex.Escape(end));
return Regex.Matches(text, regex, RegexOptions.Singleline)
.Cast<Match>()
.Select(match => match.Groups[1].Value);
}
I also added the SingleLine option so that it will match even if there are new-lines in your text.
You're going to need to better define the rules that govern your matching needs. When building any kind of matching or search code you need to be vary clear about what inputs you anticipate and what outputs you need to produce. It's very easy to produce buggy code if you don't take these questions into close consideration. That said...
You should be able to use regular expressions. Nesting may make it slightly more complicated but still doable (depending on what you expect to match in nested scenarios). Something like should get you started:
var start = "[start]";
var end = "[end]";
var regEx = new Regex(String.Format("{0}(.*){1}", Regex.Escape(start), Regex.Escape(end)));
var source = "[start]aaaaaa[end] wwwww [start]cccccc[end]";
var matches = regEx.Match( source );
It should be trivial to wrap the code above into a function appropriate for your needs.
I was bored, and thus I made a useless micro benchmark which "proves" (on my dataset, which has strings up to 7k of characters and <b> tags for start/end parameters) my suspicion that juharr's solution is the fastest of the three overall.
Results (1000000 iterations * 20 test cases):
juharr: 6371ms
Jake: 6825ms
Mark Byers: 82063ms
NOTE: Compiled regex didn't speed things up much on my dataset.

Categories