Remove substring from a list of strings

Remove substring from a list of strings - c#

I have a list of strings that contain banned words. What's an efficient way of checking if a string contains any of the banned words and removing it from the string? At the moment, I have this:
cleaned = String.Join(" ", str.Split().Where(b => !bannedWords.Contains(b,
StringComparer.OrdinalIgnoreCase)).ToArray());
This works fine for single banned words, but not for phrases (e.g. more than one word). Any instance of more than one word should also be removed. An alternative I thought of trying is to use the List's Contains method, but that only returns a bool and not an index of the matching word. If I could get an index of the matching word, I could just use String.Replace(bannedWords[i],"");

A simple String.Replace will not work as it will remove word parts. If "sex" is a banned word and you have the word "sextet", which is not banned, you should keep it as is.
Using Regex you can find whole words and phrases in a text with
string text = "A sextet is a musical composition for six instruments or voices.".
string word = "sex";
var matches = Regex.Matches(text, #"(?<=\b)" + word + #"(?=\b)");
The matches collection will be empty in this case.
You can use the Regex.Replace method
foreach (string word in bannedWords) {
text = Regex.Replace(text, #"(?<=\b)" + word + #"(?=\b)", "")
}
Note: I used the following Regex pattern
(?<=prefix)find(?=suffix)
where 'prefix' and 'suffix' are both \b, which denotes word beginnings and ends.
If your banned words or phrases can contain special characters, it would be safer to escape them with Regex.Escape(word).
Using #zmbq's idea you could create a Regex pattern once with
string pattern =
#"(?<=\b)(" +
String.Join(
"|",
bannedWords
.Select(w => Regex.Escape(w))
.ToArray()) +
#")(?=\b)";
var regex = new Regex(pattern); // Is compiled by default
and then apply it repeatedly to different texts with
string result = regex.Replace(text, "");

It doesn't work because you have conflicting definitions.
When you want to look for sub-sentences like more than one word you cannot split on whitespace anymore. You'll have to fall back on String.IndexOf()

If it's performance you're after, I assume you're not worried about one-time setup time, but rather about continuous performance. So I'd build one huge regular expression containing all the banned expressions and make sure it's compiled - that's as a setup.
Then I'd try to match it against the text, and replace every match with a blank or whatever you want to replace it with.
The reason for this, is that a big regular expression should compile into something comparable to the finite state automaton you would create by hand to handle this problem, so it should run quite nicely.

Why don't you iterate through the list of banned words and look up each of them in the string by using the method string.IndexOf.
For example, you can remove the banned words and phrases with the following piece of code:
myForbWords.ForEach(delegate(string item) {
int occ = str.IndexOf(item);
if(occ > -1) str = str.Remove(occ, item.Length);
});
Type of myForbWords is List<string>.

Related

Add wildcard to RegEx for phrases and text matching

I have a text file which consists of:
stemmed words (e.g. manipulat - stemmed from "manipulating"), and
stemmed phrases which are usually two words or more (e.g.
"acknowledg him regard the invest" - stemmed from "acknowledging him
regarding the investment").
Each word/phrase is presented in a new line. My C# code reads each line in this text file, then for each line, search all rows in the DataTable to match them. i.e. if a word/phrase appears in any rows of DataTable, my system will flag the row..
For single word, it's easily done/matched using the algorithm I have. I can match "manipulat" to words like "manipulate", "manipulating", "manipulated" and "manipulation" if they appear in the DataTable rows.
But for phrases, my algorithm can only match exactly what it is. Here I mean if my phrase is "acknowledg him regard the invest", it will only search for the exact phrase, and it won't match/flag if "acknowledging him regarding the investment" exists in DataTable rows.
I have very little knowledge in both Regex and C#. I tried to modify the below code to use wildcards but no luck so far. Would appreciate if anyone can help in this. Thank you in advanced.
string[] words = File.ReadAllLines(sourceDirTemp + comboBox_filename.SelectedItem.ToString() + ".txt");
var query = LoadComments().AsEnumerable().Where(r =>
words.Any(wordOrPhrase => Regex.IsMatch(r.Field<string>("Column_name"), #"\b"
+ Regex.Escape(wordOrPhrase) + #"\b", RegexOptions.IgnoreCase)));

When comparing the lines with the stem-words from your database using
RegEx you could extend your pattern in your code.
This will match 1 or more occurrences of any word character
\w+
This will match 0 or more occurrences of any word character
\w*
as Abbodanza already mentioned this will match any character between a and z 0 or more occurrences.
[a-z]*
EDIT:
If your algorithm works for single words you could split each phrase
string[] words = File.ReadAllLines(sourceDirTemp + comboBox_filename.SelectedItem.ToString();
foreach(var word in words)
{
// moreOrOneWord.Length would allow you to check whether it is a phrase
string [] moreOrOneWord = words.Split(' ');
var query = LoadComments().AsEnumerable().Where(r =>
moreOrOneWord.Any(wordOrPhrase => Regex.IsMatch(r.Field<string>("Column_name"), #"\b"
+ Regex.Escape(wordOrPhrase) + #"\b", RegexOptions.IgnoreCase)));
// Do something with the query...
}
This should allow you to apply your algorithm to every single word in the text.
here you can find an example to start with regular expression.
and here is a List of RegEx elements that you can use.
Hope this can help

If you split the wordOrPhrase with space, and add \w* to match 0+ alphanumeric or underscore chars (or more specific pattern to only match letters like [\p{L}\p{M}]*) to each chunk, you could use
Regex.IsMatch(r.Field<string>("Column_name"),
string.Join(" +", wordOrPhrase.Split()
.Select(p => string.Format(#"\b{0}\w*\b", Regex.Escape(p)))),
RegexOptions.IgnoreCase)
If you have a acknowledg him regard the invest wordOrPhrase, the regex will be \backnowledg\w*\b +\bhim\w*\b +\bregard\w*\b +\bthe\w*\b +\binvest\w*\b and will find a match. See this IDEONE demo.
However, with this approach, himself will get matched with him (that would be turned into him\w*).

Regex replace all matching words that do not contain a certain string

How can I use regex to replace matching strings that do not include a specific string?
input string
Keepword mywordsecond mythirdword myfourthwordKeep
string to replace
word
exclude string
Keep
Desired out put
Keepword mysecond mythird myfourthKeep

Will there ever be more than one word in a word? If there are more than one, do you want to replace all of them? If not, this should sort you out:
Regex r = new Regex(#"\b((?:(?!Keep|word)\w)*)word((?:(?!Keep)\w)*)\b");
s1 = r.Replace(s0, "$1$2");
to explain:
First, \b((?:(?!Keep|word)\w)*) captures whatever text precedes the first occurrence of word or Keep.
The next thing it sees must be word, If it sees Keep or the end of the string instead, the match attempt immediately fails.
Then ((?:(?!Keep)\w)*)\b captures the remainder of the text in order to ensure it doesn't contain Keep.
When faced with a problem like this, most users' first impulse is to match (in the sense of consuming) only the part of the string they're interested in, using lookarounds to establish the context. It's usually much easier to write the regex so that it always moves forward through the string as it matches. You capture the parts you want to retain so you can plug them back into the result string by means of group references ($1, $2, etc.).
Given that you're using C#, you could use the lookaround approach:
Regex r = new Regex(#"(?<!Keep\w*)word(?!\w*Keep)");
s1 = r.Replace(s0, "");
But please don't. There are very few regex flavors that support unrestricted lookbehinds like .NET does, and most problems don't work so neatly as this one anyway.

string str = "Keepword mywordsecond mythirdword myfourthwordKeep";
str = Regex.Replace(str, "(?<!Keep)word", "");
And I'm going to link you to a one of good Regular Expressions Cheat sheet here

This works in notepad++:
(?<!Keep)word(?!Keep)
It uses "look ahead".

You can use negative look-behind assertion if you want to remove all "word" that are not proceeded by "Keep":
String input = "Keepword mywordsecond mythirdword myfourthwordKeep";
String pattern = "(?<!Keep)word";
String output = Regex.Replace(input, pattern, "");

C# Trouble with Regex.Replace

Been scratching my head all day about this one!
Ok, so I have a string which contains the following:
?\"width=\"1\"height=\"1\"border=\"0\"style=\"display:none;\">');
I want to convert that string to the following:
?\"width=1height=1border=0style=\"display:none;\">');
I could theoretically just do a String.Replace on "\"1\"" etc. But this isn't really a viable option as the string could theoretically have any number within the expression.
I also thought about removing the string "\"", however there are other occurrences of this which I don't want to be replaced.
I have been attempting to use the Regex.Replace method as I believe this exists to solve problems along my lines. Here's what I've got:
chunkContents = Regex.Replace(chunkContents, "\".\"", ".");
Now that really messes things up (It replaces the correct elements, but with a full stop), but I think you can see what I am attempting to do with it. I am also worrying that this will only work for single numbers (\"1\" rather than \"11\").. So that led me into thinking about using the "*" or "+" expression rather than ".", however I foresaw the problem of this picking up all of the text inbetween the desired characters (which are dotted all over the place) whereas I obviously only want to replace the ones with numeric characters in between them.
Hope I've explained that clearly enough, will be happy to provide any extra info if needed :)

Try this
var str = "?\"width=\"1\"height=\"1234\"border=\"0\"style=\"display:none;\">');";
str = Regex.Replace(str , "\"(\\d+)\"", "$1");
(\\d+) is a capturing group that looks for one or more digits and $1 references what the group captured.

This works
String input = #"?\""width=\""1\""height=\""1\""border=\""0\""style=\""display:none;\"">');";
//replace the entire match of the regex with only what's captured (the number)
String result = Regex.Replace(input, #"\\""(\d+)\\""", match => match.Result("$1"));
//control string for excpected result
String shouldBe = #"?\""width=1height=1border=0style=\""display:none;\"">');";
//prints true
Console.WriteLine(result.Equals(shouldBe).ToString());

How to check the repeated characters in a string

I am creating a program that filters and check if the word is existing in a dictionary. The problem is how to know if the word has repeated characters.
For example:
string string1 = "sorrrrrrry";
that string does not exist in the dictionary but if you remove repeated r it will be "sorry".
I am using hunspell to check if the word exist in the dictionary. Any solution please? Thanks in advance

For your case what you can do is:
replace the repeated characters but 2 => "sorry"
look if the word exists on the dictionary
if not, replace the 2 repeated characters by 1 character => "sory" (if you have for example "caat")
look if the word exists on the dictionary
Using the regex (\w)\1+ (matches repeated characters) and replacing the first time by $1$1 (2 repeated matched characters) and the by $1
string input = "sorrrrrrry";
Regex regex = new Regex(#"(\w)\1+");
string replacement = "$1$1";
string res = regex.Replace(input, replacement);
Console.WriteLine(res);
//will output => sorry
replacement = "$1";
res = regex.Replace(input, replacement);
Console.WriteLine(res);
//will output => sory
Warning
This can give some results BUT it has some limitations and can produce unexpected results:
you need to handle all the combinations if more than two characters are repeated: if you have "soooorrrry" it will give you 1. "soorry" and then 2. "sory", so the algorithm will not work.
what to do with the case "gooood", is it "good" or "god" ?

You only can try to guess by several fuzzy logic methods which word is the one, wich could match SOME in the dictionary and, if more than one is found, show a list.
Perhaps You know, how a smartphone keyboard tries to help You.
This way is more or less the proper one ( during typing ) not after.
But after is also possible, but needs more effort.

You may want to look into storing the dictionary in Lucene.Net and using its loose matching capability to match the words.

Using Regex.Split to remove anything non numeric and splitting on -

I'm not sure why but for some reason The Regex Split method is going over my head. I'm trying to look through tutorials for what I need and can't seem to find anything.
I simply am reading an excel doc and want to format a string such as $145,000-$179,999 to give me two strings. 145000 and 179999. At the same time I'd like to prune a string such as '$180,000-Limit to simply 180000.
var loanLimits = Regex.Matches(Result.Rows[row + 2 + i][column].ToString(), #"\d+");
The above code seems to chop '$145,000-$179,999 up into 4 parts: 145, 000, 179, 999. Any ideas on how to achieve what I'm asking?

Regular expressions match exactly character by character (there's no knowledge of the concept of a "number" or a "word" in regular expressions - you have to define that yourself in your expression). The expression you are using, \d+, uses the character class \d, which means any digit 0-9 (and + means match one or more). So in the expression $145,000, notice that the part you are looking for is not just composed of digits; it also includes commas. So the regular expression finds every continuous group of characters that matches your regular expression, which are the four groups of numbers.
There are a couple of ways to approach the problem.
Include , in your regular expression, so (\d|,)+, which means match as many characters in a row that are either a digit or a comma. There will be two matches: 145,000 and 179,999, from which you can further remove the commas with myStr.Replace(",", ""). (DEMO)
Do as you say in the title, and remove all non-numeric characters. So you could use Regex.Replace with the expression [^\d-]+ - which means match anything that is not a digit or a hyphen - and then replace those with "". Then the result would be 145000-179999, which you can split with a simple non-regular-expression split, myStr.Split('-'), to get your two parts. (DEMO)
Note that for your second example ($180,000-Limit), you'll need an extra check to count the number of results returned from Match in the first example, and Split in the second example to determine whether there were two numbers in the range, or only a single number.

you can try to treat each string separately by spiting it based on - and extraction only numbers from it
ArrayList mystrings = new ArrayList();
List<string> myList = Result.Rows[row + 2 + i][column].ToString().Split('-').ToList();
foreach(var item in myList)
{
string result = Regex.Replace(item, #"[^\d]", "");
mystrings.Add(result);
}

An alternative to using RegEx is to use the built in string and char methods in the DotNet framework. Assuming the input string will always have a single hypen:
string input = "$145,000-$179,999";
var split = input.Split( '-' )
.Select( x => string.Join( "", x.Where( char.IsLetterOrDigit ) ) )
.ToList();
string first = split.First(); //145000
string second = split.Last(); //179999
first you split the string using the standard Split method
then you create a new string by selectively taking only Letters or Digits from each item in the collection: x.Where...
then you join the string using the standard Join method
finally, take the first and last item in the collection for your 2 strings.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.