Regex that exclude all except some words - c#

I though that filtering a string like :
"Hello <strong>plip</strong> plop"
to obtain
"plip plop", that is, excluding all words except 'plip' and 'plop' would be easy with this C# line:
new Regex("[^(plip)(plop)]").Replace(inputString,"").
Unfortunalty, the excluding brackets [^] seem to not accept exclusion words, as it keeps each letters contained in 'plip' and 'plop' (the result is "llooplipoplop").
Is there a way to achieve this in a single regex/line, or is it necessary to loop other all matches of plip and plop, then concat them?

Generally speaking, it is much easier to write a regex that matches what you do want than one that matches all the stuff you don't want.
In this case you want to "exclude all words except plip and plop", but why not just include only plip and plop instead?
var input = "Hello <strong>plip</strong> plop";
var matches = Regex.Matches(input, "plip|plop");
var result = string.Join("", matches.Cast<Match>().Select(x => x.Value));
Console.Out.WriteLine(result); // prints "plipplop"
Of course since you asked for a one-liner, you could do everything without the temp variables (and good luck to the next guy reading the code!):
var result = string.Join("", Regex.Matches("Hello <strong>plip</strong> plop", "plip|plop").Cast<Match>().Select(x => x.Value));
Also, assuming you actual word list is more complicated than plip and plop, you can do something like var pattern = string.Join("|", words); to construct the pattern.

hope this works
(?<=(\bplip\b|\bplop\b|^)).*?(?=(\bplip\b|\bplop\b|$))
You should set the singleline mode for the above regex to work
works here

Related

How to structure REGEX in C#

I currently have a regex that checks if a US State is spelled correctly
var r = new Regex(string.Format(#"\b(?:{0})\b", pattern), RegexOptions.IgnoreCase)
pattern is a pipe delimited string containing all US states.
It was working as intended today until one of the states was spelled like "Florida.." I would have liked it picked up the fact there was a fullstop character.
I found this regex that will only match letters.
^[a-zA-Z]+
How do I combine this with my current Regex or is it not possible?
I tried some variations of this but it didn't work
var r = new Regex(string.Format(#"\b^[a-zA-Z]+(?:{0})\b", pattern), RegexOptions.IgnoreCase);
EDIT: Florida.. was in my input string. My pattern string hasn't changed at all. Apologies for not being clearer.
It seems you need start of string (^) and end of string ($) anchors:
var r = new Regex(string.Format(#"^(?:{0})$", pattern), RegexOptions.IgnoreCase);
The regex above would match any string comprising a name of a state only.
You should make a replacement of the pattern variable to escape the regex special characters. One of them is the . character. Something similar to pattern.Replace(".", #"\.") but doing all the especial characters.
I believe you can't merge both patterns into one, so you would have to perform two diferent regex operations, one to split the states into a list, and a subsequent one for the validation of each item within it.
I'd rather go for something "simpler" such as
var states = input.Split('|').Select(s => new string(s.Where(char.IsLetter).ToArray()))
.Where(s => !string.IsNullOrWhiteSpace(s));
Basically don't use a regex here.
List<string> values = new List<string>() {"florida", etc.};
string input;
//is input in values, ignore case and look for any value that includes the input value
bool correct = values.Any(a =>
input.IndexOf(a, StringComparison.CurrentCultureIgnoreCase) >= 0);
This will be considerably more efficient than a regex based option. This should match florida, Florida and Florida..., etc.
Don't search for characters directly, tell regex to consume all which are not targeted specific characters such as [^\|.]+. It uses the set [ ] with the not ^ indicator says consume anything which is not a literal | or .. Hence it consumes just the text needed. Such as on
Colorado|Florida..|New Mexico
returns 3 matches of Colorado Florida and New Mexico

C# Trouble with Regex.Replace

Been scratching my head all day about this one!
Ok, so I have a string which contains the following:
?\"width=\"1\"height=\"1\"border=\"0\"style=\"display:none;\">');
I want to convert that string to the following:
?\"width=1height=1border=0style=\"display:none;\">');
I could theoretically just do a String.Replace on "\"1\"" etc. But this isn't really a viable option as the string could theoretically have any number within the expression.
I also thought about removing the string "\"", however there are other occurrences of this which I don't want to be replaced.
I have been attempting to use the Regex.Replace method as I believe this exists to solve problems along my lines. Here's what I've got:
chunkContents = Regex.Replace(chunkContents, "\".\"", ".");
Now that really messes things up (It replaces the correct elements, but with a full stop), but I think you can see what I am attempting to do with it. I am also worrying that this will only work for single numbers (\"1\" rather than \"11\").. So that led me into thinking about using the "*" or "+" expression rather than ".", however I foresaw the problem of this picking up all of the text inbetween the desired characters (which are dotted all over the place) whereas I obviously only want to replace the ones with numeric characters in between them.
Hope I've explained that clearly enough, will be happy to provide any extra info if needed :)
Try this
var str = "?\"width=\"1\"height=\"1234\"border=\"0\"style=\"display:none;\">');";
str = Regex.Replace(str , "\"(\\d+)\"", "$1");
(\\d+) is a capturing group that looks for one or more digits and $1 references what the group captured.
This works
String input = #"?\""width=\""1\""height=\""1\""border=\""0\""style=\""display:none;\"">');";
//replace the entire match of the regex with only what's captured (the number)
String result = Regex.Replace(input, #"\\""(\d+)\\""", match => match.Result("$1"));
//control string for excpected result
String shouldBe = #"?\""width=1height=1border=0style=\""display:none;\"">');";
//prints true
Console.WriteLine(result.Equals(shouldBe).ToString());

Parsing a file or directory from an semi-random text

I've got a method that is going to perform some SVN Commands (using SharpSVN) on a collection of files and or directories, based on what the user has selected within a textbox on the form.
Quickly storing some highlighted text in a variable and looking at it, sample data might be like this:
Modified -- C:\\\folder\\\trunk\\\SubFolderOne\\\SubFolderTwo\\\SubThree\r\nModified --
C:\\\folder\\\trunk\\\SubFolderOne\\\SubFolderTwo\\\SubThree\\\myFile.cs
Trying to write a Regex to parse out anything inbetween a Space and the \r character, but I can't figure it out.
I thought the pattern would be something like this:
#"\s\S*\\r"
But using my sample data here it yields this as a result:
C:\\\folder\\\trunk\\\SubFolderOne\\\SubFolderTwo\\\SubThree\r
Then I'm just going to throw each result (ie proper path/file) into a collection of strings which will be used elsewhere in the application.
Is there a better way to do this using the Path class, hopefully?
One thing I can think of would be to split up the data using substring any time it finds \r\n, then simply drop the "prefix" (Modified --, NotVersioned --, Normal --) from the strings.
That seems really... poor though.
If it helps, I do know the that the top-most directory will always be C:\\folder\\trunk
I would recommend that you split the string on "\r\n" and then match each string. For example:
Regex re = new Regex(#"\s(\S*?)$");
foreach (var line in s.Split(new[]{"\r\n"}, StringSplitOptions.RemoveEmptyEntries))
{
Match m = re.Match(s);
Console.WriteLine("{0},{1},'{2}'", m.Index, m.Length, m.Groups[1].Value);
}
That works when tested against your sample text.
You can use regex lookahead and lookbehind
String pattern = #"(?<=--\s)\S*(?=\\r|$)";
var result = Regex.Matches(input, pattern);
foreach (Match match in result)
{
Console.WriteLine(match.Value);
}
Parsing invalid values is not for Path class. You should either use regex or split and substring. Both ways are good, you should prefer the one you can easy read, explain and change.
var paths =
Regex.Split(input, #"\\r\\n")
.Select(row => row.Substring(row.LastIndexOf(' ') + 1, row.Length - row.LastIndexOf(' ') - 1));

Remove substring from a list of strings

I have a list of strings that contain banned words. What's an efficient way of checking if a string contains any of the banned words and removing it from the string? At the moment, I have this:
cleaned = String.Join(" ", str.Split().Where(b => !bannedWords.Contains(b,
StringComparer.OrdinalIgnoreCase)).ToArray());
This works fine for single banned words, but not for phrases (e.g. more than one word). Any instance of more than one word should also be removed. An alternative I thought of trying is to use the List's Contains method, but that only returns a bool and not an index of the matching word. If I could get an index of the matching word, I could just use String.Replace(bannedWords[i],"");
A simple String.Replace will not work as it will remove word parts. If "sex" is a banned word and you have the word "sextet", which is not banned, you should keep it as is.
Using Regex you can find whole words and phrases in a text with
string text = "A sextet is a musical composition for six instruments or voices.".
string word = "sex";
var matches = Regex.Matches(text, #"(?<=\b)" + word + #"(?=\b)");
The matches collection will be empty in this case.
You can use the Regex.Replace method
foreach (string word in bannedWords) {
text = Regex.Replace(text, #"(?<=\b)" + word + #"(?=\b)", "")
}
Note: I used the following Regex pattern
(?<=prefix)find(?=suffix)
where 'prefix' and 'suffix' are both \b, which denotes word beginnings and ends.
If your banned words or phrases can contain special characters, it would be safer to escape them with Regex.Escape(word).
Using #zmbq's idea you could create a Regex pattern once with
string pattern =
#"(?<=\b)(" +
String.Join(
"|",
bannedWords
.Select(w => Regex.Escape(w))
.ToArray()) +
#")(?=\b)";
var regex = new Regex(pattern); // Is compiled by default
and then apply it repeatedly to different texts with
string result = regex.Replace(text, "");
It doesn't work because you have conflicting definitions.
When you want to look for sub-sentences like more than one word you cannot split on whitespace anymore. You'll have to fall back on String.IndexOf()
If it's performance you're after, I assume you're not worried about one-time setup time, but rather about continuous performance. So I'd build one huge regular expression containing all the banned expressions and make sure it's compiled - that's as a setup.
Then I'd try to match it against the text, and replace every match with a blank or whatever you want to replace it with.
The reason for this, is that a big regular expression should compile into something comparable to the finite state automaton you would create by hand to handle this problem, so it should run quite nicely.
Why don't you iterate through the list of banned words and look up each of them in the string by using the method string.IndexOf.
For example, you can remove the banned words and phrases with the following piece of code:
myForbWords.ForEach(delegate(string item) {
int occ = str.IndexOf(item);
if(occ > -1) str = str.Remove(occ, item.Length);
});
Type of myForbWords is List<string>.

Remove all "invisible" chars from a string?

I'm writing a little class to read a list of key value pairs from a file and write to a Dictionary<string, string>. This file will have this format:
key1:value1
key2:value2
key3:value3
...
This should be pretty easy to do, but since a user is going to edit this file manually, how should I deal with whitespaces, tabs, extra line jumps and stuff like that? I can probably use Replace to remove whitespaces and tabs, but, is there any other "invisible" characters I'm missing?
Or maybe I can remove all characters that are not alphanumeric, ":" and line jumps (since line jumps are what separate one pair from another), and then remove all extra line jumps. If this, I don't know how to remove "all-except-some" characters.
Of course I can also check for errors like "key1:value1:somethingelse". But stuff like that doesn't really matter much because it's obviously the user's fault and I would just show a "Invalid format" message. I just want to deal with the basic stuff and then put all that in a try/catch block just in case anything else goes wrong.
Note: I do NOT need any whitespaces at all, even inside a key or a value.
I did this one recently when I finally got pissed off at too much undocumented garbage forming bad xml was coming through in a feed. It effectively trims off anything that doesn't fall between a space and the ~ in the ASCII table:
static public string StripControlChars(this string s)
{
return Regex.Replace(s, #"[^\x20-\x7F]", "");
}
Combined with the other RegEx examples already posted it should get you where you want to go.
If you use Regex (Regular Expressions) you can filter out all of that with one function.
string newVariable Regex.Replace(variable, #"\s", "");
That will remove whitespace, invisible chars, \n, and \r.
One of the "white" spaces that regularly bites us is the non-breakable space. Also our system must be compatible with MS-Dynamics which is much more restrictive. First, I created a function that maps the 8th bit characters to their approximate 7th bit counterpart, then I removed anything that was not in the x20 to x7f range further limited by the Dynamics interface.
Regex.Replace(s, #"[^\x20-\x7F]", "")
should do that job.
The requirements are too fuzzy. Consider:
"When is a space a value? key?"
"When is a delimiter a value? key?"
"When is a tab a value? key?"
"Where does a value end when a delimiter is used in the context of a value? key"?
These problems will result in code filled with one off's and a poor user experience. This is why we have language rules/grammar.
Define a simple grammar and take out most of the guesswork.
"{key}":"{value}",
Here you have a key/value pair contained within quotes and separated via a delimiter (,). All extraneous characters can be ignored. You could use use XML, but this may scare off less techy users.
Note, the quotes are arbitrary. Feel free to replace with any set container that will not need much escaping (just beware the complexity).
Personally, I would wrap this up in a simple UI and serialize the data out as XML. There are times not to do this, but you have given me no reason not to.
var split = textLine.Split(":").Select(s => s.Trim()).ToArray();
The Trim() function will remove all the irrelevant whitespace. Note that this retains whitespace inside of a key or value, which you may want to consider separately.
You can use string.Trim() to remove white-space characters:
var results = lines
.Select(line => {
var pair = line.Split(new[] {':'}, 2);
return new {
Key = pair[0].Trim(),
Value = pair[1].Trim(),
};
}).ToList();
However, if you want to remove all white-spaces, you can use regular expressions:
var whiteSpaceRegex = new Regex(#"\s+", RegexOptions.Compiled);
var results = lines
.Select(line => {
var pair = line.Split(new[] {':'}, 2);
return new {
Key = whiteSpaceRegex.Replace(pair[0], string.Empty),
Value = whiteSpaceRegex.Replace(pair[1], string.Empty),
};
}).ToList();
If it doesn't have to be fast, you could use LINQ:
string clean = new String(tainted.Where(c => 0 <= "ABCDabcd1234:\r\n".IndexOf(c)).ToArray());

Categories