i'm creating a blacklist of keywords which I want to check for in text files, however, i'm having trouble finding any regex documentation which will help me with the following issue.
I have a set of blacklisted keywords:
welcome, goodbye, join us
I want to check some text files for any matches. I'm using the following regex to match exact words and also the pluralized version.
string.Format(#"\b{0}s*\b", keyword)
However, I've run into an issue matching keywords with two words and a any character in between. The above regex matches 'join us', but I need to match 'join#us' or 'join_us' for example as well.
Any help would be greatly appreciated.
I thing, that the "any character in between" may cause you a lot of troubles. For example let's consider this:
We want to find "my elf"... but you probably don't want to match "myself".
Anyway. If this is OK with you replace space character with dot in the keyword using string.Replace.
. in regex will match any character.
If you are new to regexes, check this useful cheat sheet: http://www.mikesdotnetting.com/article/46/c-regular-expressions-cheat-sheet
To solve the issue with "myself" and "my elf", use something more careful than . in the regex. For example [^a-zA-Z] which will match anything except letters from a to z and A to Z, or maybe \W, which will match non-word character, which means anything except a-zA-Z0-9_, so it is equivalent to [^a-zA-Z0-9_].
Also be careful about plural forms like city - cities and all the irregular ones.
You could try something like this (I left only the {0} part of the regex):
var relevantChars = new char[]{',', '#'}; // add here anything you like
string.Format(#"{0}", keyword.Replace(" ", "(" + string.Join("|", relevantChars ) + ")"));
If you're set on using pluralization, you will have to use the PluralizationService (see this answer for more details).
And seeing that you're using a string.Format, I assume you're looping your backlist array.
So why not do it all in a neat method?
public static string GetBlacklistRegexString(string[] blacklist)
{
//It seems that this service only support engligh natively, to check later
var ps = PluralizationService.CreateService(CultureInfo.GetCultureInfo("en"));
//Using a StringBuilder for ease of use and performance,
//even though it's not easy on the eye :p
StringBuilder sb = new StringBuilder().Append(#"\b(");
//We're just going to make a unique regex with all the words
//and their plurals in a list, so we're looping here
foreach (var word in blacklist)
{
//Using a dot wasn't careful indeed... Feel free to replace
//"\W" with anything that does it for you. It will match
//any non-alphanumerical character
var regexPlural = ps.Pluralize(word).Replace(" ", #"\W");
var regexWord = word.Replace(" ", #"\W");
sb.Append(regexWord).Append('|').Append(regexPlural).Append('|');
}
sb.Remove(sb.Length - 1, 1); //removing the last '|'
sb.Append(#")\b");
return sb.ToString();
}
The usage is nothing surprising if you're already using regular expressions in .NET:
static void Main(string[] args)
{
string[] blacklist = {"Goodbye","Welcome","join us"};
string input = "Welcome, come join us at dummywebsite.com for fun and games, goodbye!";
//I assume that you want it case insensitive
Regex blacklistRegex = new Regex(GetBlacklistRegexString(blacklist), RegexOptions.IgnoreCase);
foreach (Match match in blacklistRegex.Matches(input))
{
Console.WriteLine(match);
}
Console.ReadLine();
}
We get written on the console the expected output:
Welcome
join us
goodbye
Edit: still have a problem (working on it later), if "man" is in your keywords, it will match the "men" in "women"... Weirdly I don't get this behaviour on regexhero.
Edit 2: duh, of course if I don't group the words with parenthesis, the word boundaries are just applied to the first and last one... Corrected.
Related
I currently have a regex that checks if a US State is spelled correctly
var r = new Regex(string.Format(#"\b(?:{0})\b", pattern), RegexOptions.IgnoreCase)
pattern is a pipe delimited string containing all US states.
It was working as intended today until one of the states was spelled like "Florida.." I would have liked it picked up the fact there was a fullstop character.
I found this regex that will only match letters.
^[a-zA-Z]+
How do I combine this with my current Regex or is it not possible?
I tried some variations of this but it didn't work
var r = new Regex(string.Format(#"\b^[a-zA-Z]+(?:{0})\b", pattern), RegexOptions.IgnoreCase);
EDIT: Florida.. was in my input string. My pattern string hasn't changed at all. Apologies for not being clearer.
It seems you need start of string (^) and end of string ($) anchors:
var r = new Regex(string.Format(#"^(?:{0})$", pattern), RegexOptions.IgnoreCase);
The regex above would match any string comprising a name of a state only.
You should make a replacement of the pattern variable to escape the regex special characters. One of them is the . character. Something similar to pattern.Replace(".", #"\.") but doing all the especial characters.
I believe you can't merge both patterns into one, so you would have to perform two diferent regex operations, one to split the states into a list, and a subsequent one for the validation of each item within it.
I'd rather go for something "simpler" such as
var states = input.Split('|').Select(s => new string(s.Where(char.IsLetter).ToArray()))
.Where(s => !string.IsNullOrWhiteSpace(s));
Basically don't use a regex here.
List<string> values = new List<string>() {"florida", etc.};
string input;
//is input in values, ignore case and look for any value that includes the input value
bool correct = values.Any(a =>
input.IndexOf(a, StringComparison.CurrentCultureIgnoreCase) >= 0);
This will be considerably more efficient than a regex based option. This should match florida, Florida and Florida..., etc.
Don't search for characters directly, tell regex to consume all which are not targeted specific characters such as [^\|.]+. It uses the set [ ] with the not ^ indicator says consume anything which is not a literal | or .. Hence it consumes just the text needed. Such as on
Colorado|Florida..|New Mexico
returns 3 matches of Colorado Florida and New Mexico
How can I use regular expressions to find if the string matches a pattern like [sometextornumber] is a [sometextornumber].
For instance, if the input is This is a test, the output should be this and test.
I was thinking something like ([a-zA-Z0-9]) is a([a-zA-Z0-9]) but looks like I am way off the correct path.
Your question is geared towards grabbing the first and last word of a sentence. If this is all you're going to be interested in, this pattern will suffice:
"^(\\w+)|(\\w+)$"
Pattern breakdown:
^ indicates the beginning of a line
^(\\w+) capture group for a word at the beginning of the line. This is equivalent to [a-zA-Z0-9]+, where the + says you want a one or more letters and numbers.
| acts as an OR operator in Regex
$ indicates the end of a line
(\\w+)$ capture group for a word at the end of the line. This is equivalent to [a-zA-Z0-9]+, where the + says you want a one or more letters and numbers.
This pattern allows you to ignore what's in between the first and last word, so it doesn't care about "is a", and give you one capture group to pull from.
Usage:
string data = "This is going to be a test";
Match m = Regex.Match(data, "^(\\w+)|(\\w+)$");
while (m.Success)
{
Console.WriteLine(m.Groups[0]);
m = m.NextMatch();
}
Results:
This
test
If you're really only interested in the first and last word of a sentence, you also don't need to bother with Regex. Just split the sentence by a space and grab the first and last element of the array.
string[] dataPieces = data.Split(' ');
Console.WriteLine(dataPieces[0]);
Console.WriteLine(dataPieces[dataPieces.Length - 1]);
And the results are the same.
References:
https://msdn.microsoft.com/en-us/library/hs600312(v=vs.110).aspx
https://msdn.microsoft.com/en-us/library/az24scfc(v=vs.110).aspx
Try this:
([a-zA-Z0-9])+ is a ([a-zA-Z0-9])+
Edit:
You need a space after the a since it is another word. Without the + it will only match from last letter of the first word till the first letter of the last word. The + will match 1 or more of whatever is in the (), so in this case the whole word.
If you're looking to match a specific pattern such as "This" or "Test" you can simply do a case insensitive string compare.
From your question, I'm not sure that you necessarily need a regular expression here.
Here is a quick LINQpad:
var r = new Regex("(.*) is a (.*)");
var match = r.Match("This is a test");
match.Groups.OfType<Group>().Skip(1).Select(g=>g.Value).Dump();
That outputs:
IEnumerable<String> (2 items)
This
test
How can I use regex to replace matching strings that do not include a specific string?
input string
Keepword mywordsecond mythirdword myfourthwordKeep
string to replace
word
exclude string
Keep
Desired out put
Keepword mysecond mythird myfourthKeep
Will there ever be more than one word in a word? If there are more than one, do you want to replace all of them? If not, this should sort you out:
Regex r = new Regex(#"\b((?:(?!Keep|word)\w)*)word((?:(?!Keep)\w)*)\b");
s1 = r.Replace(s0, "$1$2");
to explain:
First, \b((?:(?!Keep|word)\w)*) captures whatever text precedes the first occurrence of word or Keep.
The next thing it sees must be word, If it sees Keep or the end of the string instead, the match attempt immediately fails.
Then ((?:(?!Keep)\w)*)\b captures the remainder of the text in order to ensure it doesn't contain Keep.
When faced with a problem like this, most users' first impulse is to match (in the sense of consuming) only the part of the string they're interested in, using lookarounds to establish the context. It's usually much easier to write the regex so that it always moves forward through the string as it matches. You capture the parts you want to retain so you can plug them back into the result string by means of group references ($1, $2, etc.).
Given that you're using C#, you could use the lookaround approach:
Regex r = new Regex(#"(?<!Keep\w*)word(?!\w*Keep)");
s1 = r.Replace(s0, "");
But please don't. There are very few regex flavors that support unrestricted lookbehinds like .NET does, and most problems don't work so neatly as this one anyway.
string str = "Keepword mywordsecond mythirdword myfourthwordKeep";
str = Regex.Replace(str, "(?<!Keep)word", "");
And I'm going to link you to a one of good Regular Expressions Cheat sheet here
This works in notepad++:
(?<!Keep)word(?!Keep)
It uses "look ahead".
You can use negative look-behind assertion if you want to remove all "word" that are not proceeded by "Keep":
String input = "Keepword mywordsecond mythirdword myfourthwordKeep";
String pattern = "(?<!Keep)word";
String output = Regex.Replace(input, pattern, "");
I have a list of strings that contain banned words. What's an efficient way of checking if a string contains any of the banned words and removing it from the string? At the moment, I have this:
cleaned = String.Join(" ", str.Split().Where(b => !bannedWords.Contains(b,
StringComparer.OrdinalIgnoreCase)).ToArray());
This works fine for single banned words, but not for phrases (e.g. more than one word). Any instance of more than one word should also be removed. An alternative I thought of trying is to use the List's Contains method, but that only returns a bool and not an index of the matching word. If I could get an index of the matching word, I could just use String.Replace(bannedWords[i],"");
A simple String.Replace will not work as it will remove word parts. If "sex" is a banned word and you have the word "sextet", which is not banned, you should keep it as is.
Using Regex you can find whole words and phrases in a text with
string text = "A sextet is a musical composition for six instruments or voices.".
string word = "sex";
var matches = Regex.Matches(text, #"(?<=\b)" + word + #"(?=\b)");
The matches collection will be empty in this case.
You can use the Regex.Replace method
foreach (string word in bannedWords) {
text = Regex.Replace(text, #"(?<=\b)" + word + #"(?=\b)", "")
}
Note: I used the following Regex pattern
(?<=prefix)find(?=suffix)
where 'prefix' and 'suffix' are both \b, which denotes word beginnings and ends.
If your banned words or phrases can contain special characters, it would be safer to escape them with Regex.Escape(word).
Using #zmbq's idea you could create a Regex pattern once with
string pattern =
#"(?<=\b)(" +
String.Join(
"|",
bannedWords
.Select(w => Regex.Escape(w))
.ToArray()) +
#")(?=\b)";
var regex = new Regex(pattern); // Is compiled by default
and then apply it repeatedly to different texts with
string result = regex.Replace(text, "");
It doesn't work because you have conflicting definitions.
When you want to look for sub-sentences like more than one word you cannot split on whitespace anymore. You'll have to fall back on String.IndexOf()
If it's performance you're after, I assume you're not worried about one-time setup time, but rather about continuous performance. So I'd build one huge regular expression containing all the banned expressions and make sure it's compiled - that's as a setup.
Then I'd try to match it against the text, and replace every match with a blank or whatever you want to replace it with.
The reason for this, is that a big regular expression should compile into something comparable to the finite state automaton you would create by hand to handle this problem, so it should run quite nicely.
Why don't you iterate through the list of banned words and look up each of them in the string by using the method string.IndexOf.
For example, you can remove the banned words and phrases with the following piece of code:
myForbWords.ForEach(delegate(string item) {
int occ = str.IndexOf(item);
if(occ > -1) str = str.Remove(occ, item.Length);
});
Type of myForbWords is List<string>.
I need a regex pattern which will accommodate for the following.
I get a response from a UDP server, it's a very long string and each word is separated by \, for example:
\g79g97\g879o\wot87gord\player_0\name0\g6868o\g78og89\g79g79\player_1\name1\gyuvui\yivyil\player_2\name2\g7g87\g67og9o\v78v9i7
I need the strings after \player_n\, so in the above example I would need name0, name1 and name3,
I know this is the second regex question of the day but I have the book (Mastering Regular Expressions) on order! Thank you.
UPDATE. elusive's regex pattern will suffice, and I can add the match(0) to a textbox. However, what if I want to add all the matches to the text box ?
textBox1.Text += match.Captures[0].ToString(); //this works fine.
How do I add "all" match.captures to the text box? :s sorry for being so lame, this Regex class is brand new to me .
Try this one:
\\player_\d+\\([^\\]+)
i think that this test sample can help you
string inp = #"\g79g97\g879o\wot87gord\player_0\name0\g6868o\g78og89\g79g79\player_1\name1\gyuvui\yivyil\player_2\name2\g7g87\g67og9o\v78v9i7";
string rex = #"[\w]*[\\]player_[0-9]+[\\](?<name>[A-Za-z0-9]*)\b";
Regex re = new Regex(rex);
Match mat = re.Match(inp);
for (Match m = re.Match(inp); m.Success; m = m.NextMatch())
{
Console.WriteLine(m.Groups["name"]);
}
you can take the name of the player from the m.Groups["name"]
To get only the player name, you could use:
(?<=\\player_\d+\\)[^\\]+
This (?<=\\player_\d+\\) is something called a positive look-behind. It makes sure that the actual match [^\\]+ is preceded by the expression in the parentheses.
In this case, it's even specific to only a few regex engines (.NET being among them, luckily), in that it contains a variable length expression (due to \d+). Most regex engines only support fixed-length look-behind.
In any case, look-behind is not necessarily the best approach to this problem, match groups are simpler easier to read.