How to check the repeated characters in a string - c#

I am creating a program that filters and check if the word is existing in a dictionary. The problem is how to know if the word has repeated characters.
For example:
string string1 = "sorrrrrrry";
that string does not exist in the dictionary but if you remove repeated r it will be "sorry".
I am using hunspell to check if the word exist in the dictionary. Any solution please? Thanks in advance

For your case what you can do is:
replace the repeated characters but 2 => "sorry"
look if the word exists on the dictionary
if not, replace the 2 repeated characters by 1 character => "sory" (if you have for example "caat")
look if the word exists on the dictionary
Using the regex (\w)\1+ (matches repeated characters) and replacing the first time by $1$1 (2 repeated matched characters) and the by $1
string input = "sorrrrrrry";
Regex regex = new Regex(#"(\w)\1+");
string replacement = "$1$1";
string res = regex.Replace(input, replacement);
Console.WriteLine(res);
//will output => sorry
replacement = "$1";
res = regex.Replace(input, replacement);
Console.WriteLine(res);
//will output => sory
Warning
This can give some results BUT it has some limitations and can produce unexpected results:
you need to handle all the combinations if more than two characters are repeated: if you have "soooorrrry" it will give you 1. "soorry" and then 2. "sory", so the algorithm will not work.
what to do with the case "gooood", is it "good" or "god" ?

You only can try to guess by several fuzzy logic methods which word is the one, wich could match SOME in the dictionary and, if more than one is found, show a list.
Perhaps You know, how a smartphone keyboard tries to help You.
This way is more or less the proper one ( during typing ) not after.
But after is also possible, but needs more effort.

You may want to look into storing the dictionary in Lucene.Net and using its loose matching capability to match the words.

Related

Regex replace all matching words that do not contain a certain string

How can I use regex to replace matching strings that do not include a specific string?
input string
Keepword mywordsecond mythirdword myfourthwordKeep
string to replace
word
exclude string
Keep
Desired out put
Keepword mysecond mythird myfourthKeep
Will there ever be more than one word in a word? If there are more than one, do you want to replace all of them? If not, this should sort you out:
Regex r = new Regex(#"\b((?:(?!Keep|word)\w)*)word((?:(?!Keep)\w)*)\b");
s1 = r.Replace(s0, "$1$2");
to explain:
First, \b((?:(?!Keep|word)\w)*) captures whatever text precedes the first occurrence of word or Keep.
The next thing it sees must be word, If it sees Keep or the end of the string instead, the match attempt immediately fails.
Then ((?:(?!Keep)\w)*)\b captures the remainder of the text in order to ensure it doesn't contain Keep.
When faced with a problem like this, most users' first impulse is to match (in the sense of consuming) only the part of the string they're interested in, using lookarounds to establish the context. It's usually much easier to write the regex so that it always moves forward through the string as it matches. You capture the parts you want to retain so you can plug them back into the result string by means of group references ($1, $2, etc.).
Given that you're using C#, you could use the lookaround approach:
Regex r = new Regex(#"(?<!Keep\w*)word(?!\w*Keep)");
s1 = r.Replace(s0, "");
But please don't. There are very few regex flavors that support unrestricted lookbehinds like .NET does, and most problems don't work so neatly as this one anyway.
string str = "Keepword mywordsecond mythirdword myfourthwordKeep";
str = Regex.Replace(str, "(?<!Keep)word", "");
And I'm going to link you to a one of good Regular Expressions Cheat sheet here
This works in notepad++:
(?<!Keep)word(?!Keep)
It uses "look ahead".
You can use negative look-behind assertion if you want to remove all "word" that are not proceeded by "Keep":
String input = "Keepword mywordsecond mythirdword myfourthwordKeep";
String pattern = "(?<!Keep)word";
String output = Regex.Replace(input, pattern, "");

C# Trouble with Regex.Replace

Been scratching my head all day about this one!
Ok, so I have a string which contains the following:
?\"width=\"1\"height=\"1\"border=\"0\"style=\"display:none;\">');
I want to convert that string to the following:
?\"width=1height=1border=0style=\"display:none;\">');
I could theoretically just do a String.Replace on "\"1\"" etc. But this isn't really a viable option as the string could theoretically have any number within the expression.
I also thought about removing the string "\"", however there are other occurrences of this which I don't want to be replaced.
I have been attempting to use the Regex.Replace method as I believe this exists to solve problems along my lines. Here's what I've got:
chunkContents = Regex.Replace(chunkContents, "\".\"", ".");
Now that really messes things up (It replaces the correct elements, but with a full stop), but I think you can see what I am attempting to do with it. I am also worrying that this will only work for single numbers (\"1\" rather than \"11\").. So that led me into thinking about using the "*" or "+" expression rather than ".", however I foresaw the problem of this picking up all of the text inbetween the desired characters (which are dotted all over the place) whereas I obviously only want to replace the ones with numeric characters in between them.
Hope I've explained that clearly enough, will be happy to provide any extra info if needed :)
Try this
var str = "?\"width=\"1\"height=\"1234\"border=\"0\"style=\"display:none;\">');";
str = Regex.Replace(str , "\"(\\d+)\"", "$1");
(\\d+) is a capturing group that looks for one or more digits and $1 references what the group captured.
This works
String input = #"?\""width=\""1\""height=\""1\""border=\""0\""style=\""display:none;\"">');";
//replace the entire match of the regex with only what's captured (the number)
String result = Regex.Replace(input, #"\\""(\d+)\\""", match => match.Result("$1"));
//control string for excpected result
String shouldBe = #"?\""width=1height=1border=0style=\""display:none;\"">');";
//prints true
Console.WriteLine(result.Equals(shouldBe).ToString());

Remove substring from a list of strings

I have a list of strings that contain banned words. What's an efficient way of checking if a string contains any of the banned words and removing it from the string? At the moment, I have this:
cleaned = String.Join(" ", str.Split().Where(b => !bannedWords.Contains(b,
StringComparer.OrdinalIgnoreCase)).ToArray());
This works fine for single banned words, but not for phrases (e.g. more than one word). Any instance of more than one word should also be removed. An alternative I thought of trying is to use the List's Contains method, but that only returns a bool and not an index of the matching word. If I could get an index of the matching word, I could just use String.Replace(bannedWords[i],"");
A simple String.Replace will not work as it will remove word parts. If "sex" is a banned word and you have the word "sextet", which is not banned, you should keep it as is.
Using Regex you can find whole words and phrases in a text with
string text = "A sextet is a musical composition for six instruments or voices.".
string word = "sex";
var matches = Regex.Matches(text, #"(?<=\b)" + word + #"(?=\b)");
The matches collection will be empty in this case.
You can use the Regex.Replace method
foreach (string word in bannedWords) {
text = Regex.Replace(text, #"(?<=\b)" + word + #"(?=\b)", "")
}
Note: I used the following Regex pattern
(?<=prefix)find(?=suffix)
where 'prefix' and 'suffix' are both \b, which denotes word beginnings and ends.
If your banned words or phrases can contain special characters, it would be safer to escape them with Regex.Escape(word).
Using #zmbq's idea you could create a Regex pattern once with
string pattern =
#"(?<=\b)(" +
String.Join(
"|",
bannedWords
.Select(w => Regex.Escape(w))
.ToArray()) +
#")(?=\b)";
var regex = new Regex(pattern); // Is compiled by default
and then apply it repeatedly to different texts with
string result = regex.Replace(text, "");
It doesn't work because you have conflicting definitions.
When you want to look for sub-sentences like more than one word you cannot split on whitespace anymore. You'll have to fall back on String.IndexOf()
If it's performance you're after, I assume you're not worried about one-time setup time, but rather about continuous performance. So I'd build one huge regular expression containing all the banned expressions and make sure it's compiled - that's as a setup.
Then I'd try to match it against the text, and replace every match with a blank or whatever you want to replace it with.
The reason for this, is that a big regular expression should compile into something comparable to the finite state automaton you would create by hand to handle this problem, so it should run quite nicely.
Why don't you iterate through the list of banned words and look up each of them in the string by using the method string.IndexOf.
For example, you can remove the banned words and phrases with the following piece of code:
myForbWords.ForEach(delegate(string item) {
int occ = str.IndexOf(item);
if(occ > -1) str = str.Remove(occ, item.Length);
});
Type of myForbWords is List<string>.

Remove all "invisible" chars from a string?

I'm writing a little class to read a list of key value pairs from a file and write to a Dictionary<string, string>. This file will have this format:
key1:value1
key2:value2
key3:value3
...
This should be pretty easy to do, but since a user is going to edit this file manually, how should I deal with whitespaces, tabs, extra line jumps and stuff like that? I can probably use Replace to remove whitespaces and tabs, but, is there any other "invisible" characters I'm missing?
Or maybe I can remove all characters that are not alphanumeric, ":" and line jumps (since line jumps are what separate one pair from another), and then remove all extra line jumps. If this, I don't know how to remove "all-except-some" characters.
Of course I can also check for errors like "key1:value1:somethingelse". But stuff like that doesn't really matter much because it's obviously the user's fault and I would just show a "Invalid format" message. I just want to deal with the basic stuff and then put all that in a try/catch block just in case anything else goes wrong.
Note: I do NOT need any whitespaces at all, even inside a key or a value.
I did this one recently when I finally got pissed off at too much undocumented garbage forming bad xml was coming through in a feed. It effectively trims off anything that doesn't fall between a space and the ~ in the ASCII table:
static public string StripControlChars(this string s)
{
return Regex.Replace(s, #"[^\x20-\x7F]", "");
}
Combined with the other RegEx examples already posted it should get you where you want to go.
If you use Regex (Regular Expressions) you can filter out all of that with one function.
string newVariable Regex.Replace(variable, #"\s", "");
That will remove whitespace, invisible chars, \n, and \r.
One of the "white" spaces that regularly bites us is the non-breakable space. Also our system must be compatible with MS-Dynamics which is much more restrictive. First, I created a function that maps the 8th bit characters to their approximate 7th bit counterpart, then I removed anything that was not in the x20 to x7f range further limited by the Dynamics interface.
Regex.Replace(s, #"[^\x20-\x7F]", "")
should do that job.
The requirements are too fuzzy. Consider:
"When is a space a value? key?"
"When is a delimiter a value? key?"
"When is a tab a value? key?"
"Where does a value end when a delimiter is used in the context of a value? key"?
These problems will result in code filled with one off's and a poor user experience. This is why we have language rules/grammar.
Define a simple grammar and take out most of the guesswork.
"{key}":"{value}",
Here you have a key/value pair contained within quotes and separated via a delimiter (,). All extraneous characters can be ignored. You could use use XML, but this may scare off less techy users.
Note, the quotes are arbitrary. Feel free to replace with any set container that will not need much escaping (just beware the complexity).
Personally, I would wrap this up in a simple UI and serialize the data out as XML. There are times not to do this, but you have given me no reason not to.
var split = textLine.Split(":").Select(s => s.Trim()).ToArray();
The Trim() function will remove all the irrelevant whitespace. Note that this retains whitespace inside of a key or value, which you may want to consider separately.
You can use string.Trim() to remove white-space characters:
var results = lines
.Select(line => {
var pair = line.Split(new[] {':'}, 2);
return new {
Key = pair[0].Trim(),
Value = pair[1].Trim(),
};
}).ToList();
However, if you want to remove all white-spaces, you can use regular expressions:
var whiteSpaceRegex = new Regex(#"\s+", RegexOptions.Compiled);
var results = lines
.Select(line => {
var pair = line.Split(new[] {':'}, 2);
return new {
Key = whiteSpaceRegex.Replace(pair[0], string.Empty),
Value = whiteSpaceRegex.Replace(pair[1], string.Empty),
};
}).ToList();
If it doesn't have to be fast, you could use LINQ:
string clean = new String(tainted.Where(c => 0 <= "ABCDabcd1234:\r\n".IndexOf(c)).ToArray());

Regex Word splitting in C#

I know similar questions have been asked before, but I can't find one that is like mine, or enough like mine to help me out :). So essentially I want to split up a string which contains a bunch of words, and I don't want to return any characters that are not words (this is the key problem I am struggling with, ignoring characters). This is how I define the problem:
What constitutes a word is a string of any character a-zA-Z only
(no numbers or anything else)
In between any word, there can be any number of random other characters
I want to get back a string[] containing only the words
eg: text: "apple^&**^orange1247pear"
I want to return: apple, orange, pear in an array.
The closest I have found I suppose is this:
Regex.Split("apple^orange7pear",#"([a-zA-Z]*)")
Which splits out the apple/orange/pear, but also returns a bunch of other junk and blank strings.
Anyone know how to stop the split function from returning certain parts of the string, or is that not possible?
Thanks in advance for any help you give me :)
Split should match the tokens between your words. In your regex you've added a group around the word, so it is included in the result, but that isn't desired in this case. Note that this regex matches anything besides valid words - anything that isn't an ASCII letter:
string[] words = Regex.Split(str, "[^a-zA-Z]+");
Another option is to match the words directly:
MatchCollection matches = Regex.Matches(str, "[a-zA-Z]+");
string[] words2 = matches.Cast<Match>().Select(m => m.Value).ToArray();
The second option is probably clearer, and will not include blank elements on the start or end of the array.
var splits = Regex.Split("aaa $$$bbb ccc", #"[^A-Za-z]+");
But to include non-latin letters, I would use this:
var splits = Regex.Split("aaa $$$bbb ccc", #"\P{L}+");
Try this:
Regex.Matches("kalle kula(/()&//()nisse8978971", #"[A-Za-z]+")
Using Matches() will collect only the words, Split() will divide the string which is not what you want.
The second option Kobi listed is better and easier to control. I use the following regular expression to locate common entities such as words, numbers, email addresses in a string it will.
var regex = new Regex(#"[\p{L}\p{N}\p{M}]+(?:[-.'ยด_#][\p{L}|\p{N}|\p{M}]+)*", RegexOptions.Compiled);

Categories