Replace Bad words using Regex - c#

I am trying to create a bad word filter method that I can call before every insert and update to check the string for any bad words and replace with "[Censored]".
I have an SQL table with has a list of bad words, I want to bring them back and add them to a List or string array and check through the string of text that has been passed in and if any bad words are found replace them and return a filtered string back.
I am using C# for this.

Please see this "clbuttic" (or for your case cl[Censored]ic) article before doing a string replace without considering word boundaries:
http://www.codinghorror.com/blog/2008/10/obscenity-filters-bad-idea-or-incredibly-intercoursing-bad-idea.html
Update
Obviously not foolproof (see article above - this approach is so easy to get around or produce false positives...) or optimized (the regular expressions should be cached and compiled), but the following will filter out whole words (no "clbuttics") and simple plurals of words:
const string CensoredText = "[Censored]";
const string PatternTemplate = #"\b({0})(s?)\b";
const RegexOptions Options = RegexOptions.IgnoreCase;
string[] badWords = new[] { "cranberrying", "chuffing", "ass" };
IEnumerable<Regex> badWordMatchers = badWords.
Select(x => new Regex(string.Format(PatternTemplate, x), Options));
string input = "I've had no cranberrying sleep for chuffing chuffings days -
the next door neighbour is playing classical music at full tilt!";
string output = badWordMatchers.
Aggregate(input, (current, matcher) => matcher.Replace(current, CensoredText));
Console.WriteLine(output);
Gives the output:
I've had no [Censored] sleep for [Censored] [Censored] days - the next door neighbour is playing classical music at full tilt!
Note that "classical" does not become "cl[Censored]ical", as whole words are matched with the regular expression.
Update 2
And to demonstrate a flavour of how this (and in general basic string\pattern matching techniques) can be easily subverted, see the following string:
"I've had no cranberryıng sleep for chuffıng chuffıngs days - the next door neighbour is playing classical music at full tilt!"
I have replaced the "i"'s with Turkish lower case undottted "ı"'s. Still looks pretty offensive!

Although I'm a big fan of Regex, I think it won't help you here. You should fetch your bad word into a string List or string Array and use System.String.Replace on your incoming message.
Maybe better, use System.String.Split and .Join methods:
string mayContainBadWords = "... bla bla ...";
string[] badWords = new string[]{"bad", "worse", "worst"};
string[] temp = string.Split(badWords, StringSplitOptions.RemoveEmptyEntries);
string cleanString = string.Join("[Censored]", temp);
In the sample, mayContainBadWords is the string you want to check; badWords is a string array, you load from your bad word sql table and cleanString is your result.

you can use string.replace() method or RegEx class

There is also a nice article about it which can e found here
With a little html-parsing skills, you can get a large list with swear words from noswear

Related

How to strip a string from the point a hyphen is found within the string C#

I'm currently trying to strip a string of data that is may contain the hyphen symbol.
E.g. Basic logic:
string stringin = "test - 9894"; OR Data could be == "test";
if (string contains a hyphen "-"){
Strip stringin;
output would be "test" deleting from the hyphen.
}
Console.WriteLine(stringin);
The current C# code i'm trying to get to work is shown below:
string Details = "hsh4a - 8989";
var regexItem = new Regex("^[^-]*-?[^-]*$");
string stringin;
stringin = Details.ToString();
if (regexItem.IsMatch(stringin)) {
stringin = stringin.Substring(0, stringin.IndexOf("-") - 1); //Strip from the ending chars and - once - is hit.
}
Details = stringin;
Console.WriteLine(Details);
But pulls in an Error when the string does not contain any hyphen's.
How about just doing this?
stringin.Split('-')[0].Trim();
You could even specify the maximum number of substrings using overloaded Split constructor.
stringin.Split('-', 1)[0].Trim();
Your regex is asking for "zero or one repetition of -", which means that it matches even if your input does NOT contain a hyphen. Thereafter you do this
stringin.Substring(0, stringin.IndexOf("-") - 1)
Which gives an index out of range exception (There is no hyphen to find).
Make a simple change to your regex and it works with or without - ask for "one or more hyphens":
var regexItem = new Regex("^[^-]*-+[^-]*$");
here -------------------------^
It seems that you want the (sub)string starting from the dash ('-') if original one contains '-' or the original string if doesn't have dash.
If it's your case:
String Details = "hsh4a - 8989";
Details = Details.Substring(Details.IndexOf('-') + 1);
I wouldn't use regex for this case if I were you, it makes the solution much more complex than it can be.
For string I am sure will have no more than a couple of dashes I would use this code, because it is one liner and very simple:
string str= entryString.Split(new [] {'-'}, StringSplitOptions.RemoveEmptyEntries)[0];
If you know that a string might contain high amount of dashes, it is not recommended to use this approach - it will create high amount of different strings, although you are looking just for the first one. So, the solution would look like something like this code:
int firstDashIndex = entryString.IndexOf("-");
string str = firstDashIndex > -1? entryString.Substring(0, firstDashIndex) : entryString;
you don't need a regex for this. A simple IndexOf function will give you the index of the hyphen, then you can clean it up from there.
This is also a great place to start writing unit tests as well. They are very good for stuff like this.
Here's what the code could look like :
string inputString = "ho-something";
string outPutString = inputString;
var hyphenIndex = inputString.IndexOf('-');
if (hyphenIndex > -1)
{
outPutString = inputString.Substring(0, hyphenIndex);
}
return outPutString;

Find index of first Char(Letter) in string

I have a mental block and can't seem to figure this out, sure its pretty easy 0_o
I have the following string: "5555S1"
String can contain any number of digits, followed by a Letter(A-Z), followed by numbers again.
How do I get the index of the Letter(S), so that I can substring so get everything following the Letter
Ie: 5555S1
Should return S1
Cheers
You could also check if the integer representation of the character is >= 65 && <=90.
Simple Python:
test = '5555Z187456764587368457638'
for i in range(0,len(test)):
if test[i].isalpha():
break
print test[i:]
Yields: Z187456764587368457638
Given that you didn't say what language your using I'm going to pick the one I want to answer in - c#
String.Index see http://msdn.microsoft.com/en-us/library/system.string.indexof.aspx for more
for good measure here it is in java string.indexOf
One way could be to loop through the string untill you find a letter.
while(! isAlpha(s[i])
i++;
or something should work.
This doesn't answer your question but it does solve your problem.
(Although you can use it to work out the index)
Your problem is a good candidate for Regular Expressions (regex)
Here is one I prepared earlier:
String code = "1234A0987";
//timeout optional but needed for security (so bad guys dont overload your server)
TimeSpan timeout = TimeSpan.FromMilliseconds(150);
//Magic here:
//Pattern == (Block of 1 or more numbers)(block of 1 or more not numbers)(Block of 1 or more numbers)
String regexPattern = #"^(?<firstNum>\d+)(?<notNumber>\D+)(?<SecondNum>\d+)?";
Regex r = new Regex(regexPattern, RegexOptions.None, timeout);
Match m = r.Match(code);
if (m.Success)//We got a match!
{
Console.WriteLine ("SecondNumber: {0}",r.Match(code).Result("${SecondNum}"));
Console.WriteLine("All data (formatted): {0}",r.Match(code).Result("${firstNum}-${notNumber}-${SecondNum}"));
Console.WriteLine("Offset length (not that you need it now): {0}", r.Match(code).Result("${firstNum}").Length);
}
Output:
SecondNumber: 0987
All data (formatted): 1234-A-0987
Offset length (not that you need it now): 4
Further info on this example here.
So there you go you can even work out what that index was.
Regex cheat sheet

How do I parse out Author information using a Regex in C#?

I have the following text:
BATTLE HYMN OF THE TIGER MOTHER, by Amy Chua. (Penguin
Press, $25.95.) A Chinese-American mother makes a case for strict
and demanding parenting
I'd like to use a regex to parse out:
Title
Author
Publisher
MSRP (Retail Price)
Description
How do I write a regex to do this in C#?
Just saw answers were allowed again. This is my recommended regex:
^(?<title>[\w\s]*), by (?<author>[\w\s]*)\. \((?<publisher>[\w\s]*), (?<msrp>.*)\.\) (?<description>.*)$
It will give you a named capture for the fields above and can be used in C# like this:
private void Main()
{
string input = "BATTLE HYMN OF THE TIGER MOTHER, by Amy Chua. (Penguin Press, $25.95.) A Chinese-American mother makes a case for strict and demanding parenting";
string pattern = #"^(?<title>[\w\s]*), by (?<author>[\w\s]*)\. \((?<publisher>[\w\s]*), (?<msrp>.*)\.\) (?<description>.*)$";
MatchCollection myMatchCollection = Regex.Matches(input, pattern);
foreach (Match myMatch in myMatchCollection)
{
var title = myMatch.Groups["title"];
var author = myMatch.Groups["author"];
var publisher = myMatch.Groups["publisher"];
var msrp = myMatch.Groups["msrp"];
var description = myMatch.Groups["description"];
}
}
I think it might be simpler to:
Split on "(" or ")"
Split on "by" for the left part
Split on ", " for the middle part
right part is your description
Using the string.Split() method.
This all of course depends on how reliable the pattern is--as the above commenters mention.
This does it:
^([ \w]+), by ([ \w]+). \(([ \w]+), ([$.\d]+)\) ([ \w-]+)$
You can add named groups to pull them out by name or just the matches by index. However this will most likely be incredibly brittle unless your source data is very strict.
I've also only done it for this one example, the description has a - in it, which is an example of a special character in the names, so you might want to make sure those are handled as you expect.

.NET String parsing performance improvement - Possible Code Smell

The code below is designed to take a string in and remove any of a set of arbitrary words that are considered non-essential to a search phrase.
I didn't write the code, but need to incorporate it into something else. It works, and that's good, but it just feels wrong to me. However, I can't seem to get my head outside the box that this method has created to think of another approach.
Maybe I'm just making it more complicated than it needs to be, but I feel like this might be cleaner with a different technique, perhaps by using LINQ.
I would welcome any suggestions; including the suggestion that I'm over thinking it and that the existing code is perfectly clear, concise and performant.
So, here's the code:
private string RemoveNonEssentialWords(string phrase)
{
//This array is being created manually for demo purposes. In production code it's passed in from elsewhere.
string[] nonessentials = {"left", "right", "acute", "chronic", "excessive", "extensive",
"upper", "lower", "complete", "partial", "subacute", "severe",
"moderate", "total", "small", "large", "minor", "multiple", "early",
"major", "bilateral", "progressive"};
int index = -1;
for (int i = 0; i < nonessentials.Length; i++)
{
index = phrase.ToLower().IndexOf(nonessentials[i]);
while (index >= 0)
{
phrase = phrase.Remove(index, nonessentials[i].Length);
phrase = phrase.Trim().Replace(" ", " ");
index = phrase.IndexOf(nonessentials[i]);
}
}
return phrase;
}
Thanks in advance for your help.
Cheers,
Steve
This appears to be an algorithm for removing stop words from a search phrase.
Here's one thought: If this is in fact being used for a search, do you need the resulting phrase to be a perfect representation of the original (with all original whitespace intact), but with stop words removed, or can it be "close enough" so that the results are still effectively the same?
One approach would be to tokenize the phrase (using the approach of your choice - could be a regex, I'll use a simple split) and then reassemble it with the stop words removed. Example:
public static string RemoveStopWords(string phrase, IEnumerable<string> stop)
{
var tokens = Tokenize(phrase);
var filteredTokens = tokens.Where(s => !stop.Contains(s));
return string.Join(" ", filteredTokens.ToArray());
}
public static IEnumerable<string> Tokenize(string phrase)
{
return string.Split(phrase, ' ');
// Or use a regex, such as:
// return Regex.Split(phrase, #"\W+");
}
This won't give you exactly the same result, but I'll bet that it's close enough and it will definitely run a lot more efficiently. Actual search engines use an approach similar to this, since everything is indexed and searched at the word level, not the character level.
I guess your code is not doing what you want it to do anyway. "moderated" would be converted to "d" if I'm right. To get a good solution you have to specify your requirements a bit more detailed. I would probably use Replace or regular expressions.
I would use a regular expression (created inside the function) for this task. I think it would be capable of doing all the processing at once without having to make multiple passes through the string or having to create multiple intermediate strings.
private string RemoveNonEssentialWords(string phrase)
{
return Regex.Replace(phrase, // input
#"\b(" + String.Join("|", nonessentials) + #")\b", // pattern
"", // replacement
RegexOptions.IgnoreCase)
.Replace(" ", " ");
}
The \b at the beginning and end of the pattern makes sure that the match is on a boundary between alphanumeric and non-alphanumeric characters. In other words, it will not match just part of the word, like your sample code does.
Yeah, that smells.
I like little state machines for parsing, they can be self-contained inside a method using lists of delegates, looping through the characters in the input and sending each one through the state functions (which I have return the next state function based on the examined character).
For performance I would flush out whole words to a string builder after I've hit a separating character and checked the word against the list (might use a hash set for that)
I would create A Hash table of Removed words parse each word if in the hash remove it only one time through the array and I believe that creating a has table is O(n).
How does this look?
foreach (string nonEssent in nonessentials)
{
phrase.Replace(nonEssent, String.Empty);
}
phrase.Replace(" ", " ");
If you want to go the Regex route, you could do it like this. If you're going for speed it's worth a try and you can compare/contrast with other methods:
Start by creating a Regex from the array input. Something like:
var regexString = "\\b(" + string.Join("|", nonessentials) + ")\\b";
That will result in something like:
\b(left|right|chronic)\b
Then create a Regex object to do the find/replace:
System.Text.RegularExpressions.Regex regex = new System.Text.RegularExpressions.Regex(regexString, System.Text.RegularExpressions.RegexOptions.IgnoreCase);
Then you can just do a Replace like so:
string fixedPhrase = regex.Replace(phrase, "");

Highlight a list of words using a regular expression in c#

I have some site content that contains abbreviations. I have a list of recognised abbreviations for the site, along with their explanations. I want to create a regular expression which will allow me to replace all of the recognised abbreviations found in the content with some markup.
For example:
content: This is just a little test of the memb to see if it gets picked up.
Deb of course should also be caught here.
abbreviations: memb = Member; deb = Debut;
result: This is just a little test of the [a title="Member"]memb[/a] to see if it gets picked up.
[a title="Debut"]Deb[/a] of course should also be caught here.
(This is just example markup for simplicity).
Thanks.
EDIT:
CraigD's answer is nearly there, but there are issues. I only want to match whole words. I also want to keep the correct capitalisation of each word replaced, so that deb is still deb, and Deb is still Deb as per the original text. For example, this input:
This is just a little test of the memb.
And another memb, but not amemba.
Deb of course should also be caught here.deb!
First you would need to Regex.Escape() all the input strings.
Then you can look for them in the string, and iteratively replace them by the markup you have in mind:
string abbr = "memb";
string word = "Member";
string pattern = String.Format("\b{0}\b", Regex.Escape(abbr));
string substitue = String.Format("[a title=\"{0}\"]{1}[/a]", word, abbr);
string output = Regex.Replace(input, pattern, substitue);
EDIT: I asked if a simple String.Replace() wouldn't be enough - but I can see why regex is desirable: you can use it to enforce "whole word" replacements only by making a pattern that uses word boundary anchors.
You can go as far as building a single pattern from all your escaped input strings, like this:
\b(?:{abbr_1}|{abbr_2}|{abbr_3}|{abbr_n})\b
and then using a match evaluator to find the right replacement. This way you can avoid iterating the input string more than once.
Not sure how well this will scale to a big word list, but I think it should give the output you want (although in your question the 'result' seems identical to 'content')?
Anyway, let me know if this is what you're after
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text.RegularExpressions;
namespace ConsoleApplication1
{
class Program
{
static void Main(string[] args)
{
var input = #"This is just a little test of the memb to see if it gets picked up.
Deb of course should also be caught here.";
var dictionary = new Dictionary<string,string>
{
{"memb", "Member"}
,{"deb","Debut"}
};
var regex = "(" + String.Join(")|(", dictionary.Keys.ToArray()) + ")";
foreach (Match metamatch in Regex.Matches(input
, regex /*#"(memb)|(deb)"*/
, RegexOptions.IgnoreCase | RegexOptions.ExplicitCapture))
{
input = input.Replace(metamatch.Value, dictionary[metamatch.Value.ToLower()]);
}
Console.Write (input);
Console.ReadLine();
}
}
}
For anyone interested, here is my final solution. It is for a .NET user control. It uses a single pattern with a match evaluator, as suggested by Tomalak, so there is no foreach loop. It's an elegant solution, and it gives me the correct output for the sample input while preserving correct casing for matched strings.
public partial class Abbreviations : System.Web.UI.UserControl
{
private Dictionary<String, String> dictionary = DataHelper.GetAbbreviations();
protected void Page_Load(object sender, EventArgs e)
{
string input = "This is just a little test of the memb. And another memb, but not amemba to see if it gets picked up. Deb of course should also be caught here.deb!";
var regex = "\\b(?:" + String.Join("|", dictionary.Keys.ToArray()) + ")\\b";
MatchEvaluator myEvaluator = new MatchEvaluator(GetExplanationMarkup);
input = Regex.Replace(input, regex, myEvaluator, RegexOptions.IgnoreCase);
litContent.Text = input;
}
private string GetExplanationMarkup(Match m)
{
return string.Format("<b title='{0}'>{1}</b>", dictionary[m.Value.ToLower()], m.Value);
}
}
The output looks like this (below). Note that it only matches full words, and that the casing is preserved from the original string:
This is just a little test of the <b title='Member'>memb</b>. And another <b title='Member'>memb</b>, but not amemba to see if it gets picked up. <b title='Debut'>Deb</b> of course should also be caught here.<b title='Debut'>deb</b>!
I doubt it will perform better than just doing normal string.replace, so if performance is critical measure (refactoring a bit to use a compiled regex). You can do the regex version as:
var abbrsWithPipes = "(abbr1|abbr2)";
var regex = new Regex(abbrsWithPipes);
return regex.Replace(html, m => GetReplaceForAbbr(m.Value));
You need to implement GetReplaceForAbbr, which receives the specific abbr being matched.
I'm doing pretty exactly what you're looking for in my application and this works for me:
the parameter str is your content:
public static string GetGlossaryString(string str)
{
List<string> glossaryWords = GetGlossaryItems();//this collection would contain your abbreviations; you could just make it a Dictionary so you can have the abbreviation-full term pairs and use them in the loop below
str = string.Format(" {0} ", str);//quick and dirty way to also search the first and last word in the content.
foreach (string word in glossaryWords)
str = Regex.Replace(str, "([\\W])(" + word + ")([\\W])", "$1<span class='glossaryItem'>$2</span>$3", RegexOptions.IgnoreCase);
return str.Trim();
}

Categories