RegEx for a Glossary Function - c#

I'm working on a web-based help system that will auto-insert links into the explanatory text, taking users to other topics in help. I have hundreds of terms that should be linked, i.e.
"Manuals and labels" (describes these concepts in general)
"Delete Manuals and Labels" (describes this specific action)
"Learn more about adding manuals and labels" (again, more specific action)
I have a RegEx to find / replace whole words (good ol' \b), which works great, except for linked terms found inside other linked terms. Instead of:
Learn more about manuals and labels
I end up with
Learn more about <a href="#">manuals and labels</a>
Which makes everyone cry a little. Changing the order in which the terms are replaced (going shortest to longest) means that I''d get:
Learn more about manuals and labels
Without the outer link I really need.
The further complication is that the capitalization of the search terms can vary, and I need to retain the original capitalization. If I could do something like this, I'd be all set:
Regex _regex = new Regex("\\b" + termToFind + "(|s)" + "\\b", RegexOptions.IgnoreCase);
string resultingText = _regex.Replace(textThatNeedsLinksInserted, "<a>" + "$&".Replace(" ", "_") + "</a>));
And then after all the terms are done, remove the "_", that would be perfect. "Learn_more_about_manuals_and_labels" wouldn't match "manuals and labels," and all is well.
It would be hard to have the help authors delimit the terms that need to be replaced when writing the text -- they're not used to coding. Also, this would limit the flexibility to add new terms later, since we'd have to go back and add delimiters to all the previously written text.
Is there a RegEx that would let me replace whitespace with "_" in the original match? Or is there a different solution that's eluding me?

From your examples with nested links it sounds like you're making individual passes over the terms and performing multiple Regex.Replace calls. Since you're using a regex you should let it do the heavy lifting and put a nice pattern together that makes use of alternation.
In other words, you likely want a pattern like this: \b(term1|term2|termN)\b
var input = "Having trouble with your manuals and labels? Learn more about adding manuals and labels. Need to get rid of them? Try to delete manuals and labels.";
var terms = new[]
{
"Learn more about adding manuals and labels",
"Delete Manuals and Labels",
"manuals and labels"
};
var pattern = #"\b(" + String.Join("|", terms) + #")\b";
var replacement = #"$1";
var result = Regex.Replace(input, pattern, replacement, RegexOptions.IgnoreCase);
Console.WriteLine(result);
Now, to address the issue of a corresponding href value for each term, you can use a dictionary and change the regex to use a MatchEvaluator that will return the custom format and look up the value from the dictionary. The dictionary also ignores case by passing in StringComparer.OrdinalIgnoreCase. I tweaked the pattern slightly by adding ?: at the start of the group to make it a non-capturing group since I am no longer referring to the captured item as I did in the first example.
var terms = new Dictionary<string, string>(StringComparer.OrdinalIgnoreCase)
{
{ "Learn more about adding manuals and labels", "2.html" },
{ "Delete Manuals and Labels", "3.html" },
{ "manuals and labels", "1.html" }
};
var pattern = #"\b(?:" + String.Join("|", terms.Select(t => t.Key)) + #")\b";
var result = Regex.Replace(input, pattern,
m => String.Format(#"{1}", terms[m.Value], m.Value),
RegexOptions.IgnoreCase);
Console.WriteLine(result);

I would use an ordered dictionary like this, making sure the smallest term is last:
using System;
using System.Text.RegularExpressions;
using System.Collections.Specialized;
public class Test
{
public static void Main()
{
OrderedDictionary Links = new OrderedDictionary();
Links.Add("Learn more about adding manuals and labels", "2");
Links.Add("Delete Manuals and Labels", "3");
Links.Add("manuals and labels", "1");
string text = "Having trouble with your manuals and labels? Learn more about adding manuals and labels. Need to get rid of them? Try to delete manuals and labels.";
foreach (string termToFind in Links.Keys)
{
Regex _regex = new Regex(#"\b" + termToFind + #"s?\b(?![^<>]*</)", RegexOptions.IgnoreCase);
text = _regex.Replace(text, #"$&");
}
Console.WriteLine(text);
}
}
ideone demo
The negative lookahead ((?![^<>]*</)) I added prevents the replace of a part you already replaced before which is between anchor tags.

First, you can prevent your Regex for manuals and labels from finding Learn more about manuals and labels by using a lookbehind. Modified your regex looks like this:
(?<!Learn more about )(manuals and labels)
But for your specific request i would suggest a different solution. You should define a rule or priority list for your regexs or both. A possible rule could be "always search for the regex first that matches the most characters". This however requires that your regexs are always fixed length. And it does not prevent one regex from consuming and replacing characters that would have been matched by a different regex (maybe even of the same size).
Of course you will need to add an additional lookbehind and lookahead to each of your regexs to prevent replacing strings that are inside of your replacing elements

Related

Filtering HTML document with 1-10k keywords

I have a html Document and want to filter it against the occurrency of multiple (1 - 10k) [1k at the moment, later on up to 10k] keywords.
I have a precompiled regex which stores my searchterms like:
static Regex r = new Regex(#"keyword1|keyword2|keyword999",RegexOptions.Compiled | RegexOptions.IgnoreCase);
This is my code:
Stopwatch sw = new Stopwatch();
sw.Start();
MatchCollection matches = Cache.r.Matches(doc.DocumentNode.InnerHtml);
string s = "";
if (matches.Count > 0)
{
foreach (Match m in matches)
{
s += m.Value + ",";
}
}
long time = sw.ElapsedMilliseconds;
Console.Write(time + " = "+matches.Count+" -> "+s );
The average time takes about 5-8 seconds. Which is way too much.
Is there any efficient way to filter a html document against alot of keywords?
Or maybe there are more efficient algorythms to filter this..
As lboshuizen pointed out
Creating a regex with 10k keywords seems not the way to go [...]
If you can afford spawning multiple threads you can scan the document in parallel for occurences of keywords:
IEnumerable<string> keywords = LoadKeywords();
List<string> list = new List<string>();
keywords.AsParallel()
.Aggregate(list, (seed, keyword) =>
{
if(doc.DocumentNode.InnerHtml.Contains(keyword))
seed.Add(keyword);
return seed;
});
You should use StringBuilder instead of string..
Unless you tell us more about what the keywords are,there is hardly any optimization..
Some of the answers are already pretty good, but I figured I'd throw this in as well ...
I've done the same thing and I used the HTML Agility Pack to help cut down on what I was analyzing for keywords.
http://htmlagilitypack.codeplex.com/
It's very easy to take an HTML fragment, search only for textual nodes and then run your keyword analysis over that space instead of the entire document.
Also it helps get rid of false positives (keywords appearing in javascript comments, alt tags, whatever else).
Just an idea to try and trim down your search space.
Suggestion:
Creating a regex with 10k keywords seems not the way to go from my POV. A regex is greedy and will try all kind of redundant matches. (=wasting time)
Building regex's with smaller keyword-sets and run them incremental in your html document.
Optimization can be to remove the the matched keywords (and related content) from the document, the will shrink and the remaining regex's has much less to do == run faster.
Or
Turn it around, don;t use a regex to scan agains a document.
Break down the document in to words and check them agains a dictionary. I doubt that the document will contain all 10k words. (looping from the smallest set is more efficient then from the largest set)

Can anyone recommend a method to perform the following string operation using C#

Suppose I have a string:
"my event happened in New York on Broadway in 1976"
I have many such strings, but the locations and dates vary. For example:
"my event happened in Boston on 2nd Street in 1998"
"my event happened in Ann Arbor on Washtenaw in 1968"
so the general form is:
"my event happened in X on Y in Z"
I would like to parse the string to extract X, Y and Z
I could use Split and use the sentinel words "in", "on" to delimit the token I want but this seems clunky. But using a full parser/lexer like grammatica seems heavyweight.
Recommendations would be gratefully accepted.
Is there a "simple" parser lexer for C#?
KISS applies here. Just do the String.Split solution, or use String.IndexOf to find the "in" and "out" (frankly, String.Split is the simplest). You don't need anything more complicated for such a simple "grammar"; note in particular that regex is overkill here.
Try using regex pattern matching. Here's an MSDN link that should be pretty helpful:
http://support.microsoft.com/kb/308252
An example might help. Note that a regex solution gives you scope to accept more variants as and when you see them. I reject the idea that RegEx is overkill, by the way. I'm no expert but it's so easy to do stuff like this I do wonder why it's not used more frequently.
var regEx = new Regex(
"(?<intro>.+) in (?<city>.+) on (?<locality>.+) in (?<eventDate>.+)"
);
var match = regEx.Match("My event happens in Baltimore on Main Street in 1876.");
if (!match.Success) return;
foreach (var group in new[] {"intro", "city", "locality", "eventDate"})
{
Console.WriteLine(group + ":" + match.Groups[group]);
}
Finally, if performance is a real worry (though ignore this if it isn't), look here for optimisation tips.
If you are sure that the string is always going to be in that format then you can do as you have already figured out by splitting by words "in" and then by "on".
To be sure you would like to then search for the Found words in a Database of City names and Year for Validity of search.
If string may not be in that format always then you can do is Search for the whole string for Words and match them against database of City names and Years and check them for Validity.

Using .NET RegEx to retrieve part of a string after the second '-'

This is my first stack message. Hope you can help.
I have several strings i need to break up for use later. Here are a couple of examples of what i mean....
fred-064528-NEEDED
frederic-84728957-NEEDED
sam-028-NEEDED
As you can see above the string lengths vary greatly so regex i believe is the only way to achieve what i want. what i need is the rest of the string after the second hyphen ('-').
i am very weak at regex so any help would be great.
Thanks in advance.
Just to offer an alternative without using regex:
foreach(string s in list)
{
int x = s.LastIndexOf('-')
string sub = s.SubString(x + 1)
}
Add validation to taste.
Something like this. It will take anything (except line breaks) after the second '-' including the '-' sign.
var exp = #"^\w*-\w*-(.*)$";
var match = Regex.Match("frederic-84728957-NEE-DED", exp);
if (match.Success)
{
var result = match.Groups[1]; //Result is NEE-DED
Console.WriteLine(result);
}
EDIT: I answered another question which relates to this. Except, it asked for a LINQ solution and my answer was the following which I find pretty clear.
Pimp my LINQ: a learning exercise based upon another post
var result = String.Join("-", inputData.Split('-').Skip(2));
or
var result = inputData.Split('-').Skip(2).FirstOrDefault(); //If the last part is NEE-DED then only NEE is returned.
As mentioned in the other SO thread it is not the fastest way of doing this.
If they are part of larger text:
(\w+-){2}(\w+)
If there are presented as whole lines, and you know you don't have other hyphens, you may also use:
[^-]*$
Another option, if you have each line as a string, is to use split (again, depending on whether or not you're expecting extra hyphens, you may omit the count parameter, or use LastIndexOf):
string[] tokens = line.Split("-".ToCharArray(), 3);
string s = tokens.Last();
This should work:
.*?-.*?-(.*)
This should do the trick:
([^\-]+)\-([^\-]+)\-(.*?)$
the regex pattern will be
(?<first>.*)?-(?<second>.*)?-(?<third>.*)?(\s|$)
then you can get the named group "second" to get the test after 2nd hyphen
alternatively
you can do a string.split('-') and get the 2 item from the array

help with a tag removal regex

I have strings in the form: "[user:fred][priority:3]Lorem ipsum dolor sit amet." where the area enclosed in square brackets is a tag (in the format [key:value]). I need to be able to remove a specific tag given it's key with the following extension method:
public static void RemoveTagWithKey(this string message, string tagKey) {
if (message.ContainsTagWithKey(tagKey)) {
var regex = new Regex(#"\[" + tagKey + #":[^\]]");
message = regex.Replace(message , string.Empty);
}
}
public static bool ContainsTagWithKey(this string message, string tagKey) {
return message.Contains(string.Format("[{0}:", tagKey));
}
Only the tag with the specified key should be removed from the string. My regex doesn't work because it's daft. I need help to write it properly. Alternatively, an implementation without regex is welcome.
I know there are much more feature-rich tools out there, but I like the simplicity and cleanliness of Code Architects Regex Tester (aka YART: Yet Another Regex Tester). Shows groups and captures in a tree view, quite fast, very small, open source. It also generates code in C++, VB, and C# and can automatically escape or unescape regexes for these languages. I dump it in my VS tools folder (C:\Program Files\Microsoft Visual Studio 9.0\Common7\Tools) and set a menu item to it in the Tools menu with Tools > External Tools so I can fire it up quickly from inside VS.
Regexes can be really hard to write sometimes and I know it really helps to be able to test the regex and see the results as you go.
(source: dotnet2themax.com)
Another really popular (but not free) option is Regex Buddy.
If you want to do this without a Regex it isn't difficult. You're already searching for a specific tag key, so you can just search for "[" + tagKey, then search from there for the closing "]", and remove everything between those offsets. Something like...
int posStart = message.IndexOf("[" + tagKey + ":");
if(posStart >= 0)
{
int posEnd = message.IndexOf("]", posStart);
if(posEnd > posStart)
{
message = message.Remove(posStart, posEnd - posStart);
}
}
Is that better than a Regex solution? Since you're only looking for a specific key I think it probably is, on the grounds of simplicity. I love Regexes but they're not always the clearest answer.
Edit: Another reason the IndexOf() solution could be seen as better is that it means there is only one rule for finding the start of the tag, whereas the original code uses a Contains() which searches for something like '[tag:' and then uses a regex which uses a slightly different expression to do the substitution / removal. In theory you could have text which matches one criterion but not the other.
Try this instead:
new Regex(#"\[" + tagKey + #":[^\]+]");
The only thing I changed was to add + to the [^\] pattern, meaning that you match one or more characters that are not a backslash.
I think this is the regex you're looking for:
string regex = #"\[" + tag + #":[^\]+]\]";
Also, you don't need to do a separate check to see if there are tags of that type. Just do a regex replace; if there are no matches, the original string is returned.
public static string RemoveTagWithKey(string message, string tagKey) {
string regex = #"\[" + tag + #":[^\]+]\]";
return Regex.Replace(message, regex, string.Empty);
}
You seem to be writing an extension method, but I wrote this as a static utility method to keep things simple.

Easiest way to convert a URL to a hyperlink in a C# string?

I am consuming the Twitter API and want to convert all URLs to hyperlinks.
What is the most effective way you've come up with to do this?
from
string myString = "This is my tweet check it out http://tinyurl.com/blah";
to
This is my tweet check it out http://tinyurl.com/>blah
Regular expressions are probably your friend for this kind of task:
Regex r = new Regex(#"(https?://[^\s]+)");
myString = r.Replace(myString, "$1");
The regular expression for matching URLs might need a bit of work.
I did this exact same thing with jquery consuming the JSON API here is the linkify function:
String.prototype.linkify = function() {
return this.replace(/[A-Za-z]+:\/\/[A-Za-z0-9-_]+\.[A-Za-z0-9-_:%&\?\/.=]+/, function(m) {
return m.link(m);
});
};
This is actually an ugly problem. URLs can contain (and end with) punctuation, so it can be difficult to determine where a URL actually ends, when it's embedded in normal text. For example:
http://example.com/.
is a valid URL, but it could just as easily be the end of a sentence:
I buy all my witty T-shirts from http://example.com/.
You can't simply parse until a space is found, because then you'll keep the period as part of the URL. You also can't simply parse until a period or a space is found, because periods are extremely common in URLs.
Yes, regex is your friend here, but constructing the appropriate regex is the hard part.
Check out this as well: Expanding URLs with Regex in .NET.
You can add some more control on this by using MatchEvaluator delegate function with regular expression:
suppose i have this string:
find more on http://www.stackoverflow.com
now try this code
private void ModifyString()
{
string input = "find more on http://www.authorcode.com ";
Regex regx = new Regex(#"\b((http|https|ftp|mailto)://)?(www.)+[\w-]+(/[\w- ./?%&=]*)?");
string result = regx.Replace(input, new MatchEvaluator(ReplaceURl));
}
static string ReplaceURl(Match m)
{
string x = m.ToString();
x = "< a href=\"" + x + "\">" + x + "</a>";
return x;
}
/cheer for RedWolves
from: this.replace(/[A-Za-z]+://[A-Za-z0-9-]+.[A-Za-z0-9-:%&\?/.=]+/, function(m){...
see: /[A-Za-z]+://[A-Za-z0-9-]+.[A-Za-z0-9-:%&\?/.=]+/
There's the code for the addresses "anyprotocol"://"anysubdomain/domain"."anydomainextension and address",
and it's a perfect example for other uses of string manipulation. you can slice and dice at will with .replace and insert proper "a href"s where needed.
I used jQuery to change the attributes of these links to "target=_blank" easily in my content-loading logic even though the .link method doesn't let you customize them.
I personally love tacking on a custom method to the string object for on the fly string-filtering (the String.prototype.linkify declaration), but I'm not sure how that would play out in a large-scale environment where you'd have to organize 10+ custom linkify-like functions. I think you'd definitely have to do something else with your code structure at that point.
Maybe a vet will stumble along here and enlighten us.

Categories