How to negate filename after a specific term in a regex - c#

I have a regex that detect urls:
#"((http|ftp|https)\:\/\/)?([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,#?^=%&:/~+#-]*[\w#?^=%&/~+#-])?";
I am using it with regex.replace to remove urls from text.
I do not want it to replace any word that starts with /images
for example if the text is "this is my text here is a link http://dfdf.com and my is /images/dd.gif"
I need the http://dfdf.com replaces but not the /images/dd.gif
my regex replaces the dd.gif
so I want to negate any word after images/
any idea how can I fix this ?

You may start matching after a word boundary, and fail the match if it is immediately preceded with a whole "word" images/ using
\b(?<!\bimages/)(?:(?:http|ftp)s?://)?([\w-]+(?:\.[\w-]+)+)([\w.,#?^=%&:/~+#-]*[\w#?^=%&/~+#-])?
See the regex demo. Details:
\b - a word boundary
(?<!\bimages/) - no images/ as a whole word is allowed immediately on the left
(?:(?:http|ftp)s?://)? - an optional sequence of either http or ftp followed with an optional s and then :// substring
([\w-]+(?:\.[\w-]+)+) - Group 1: one or more word or hyphen chars followed with one or more sequences of a . and then one or more word or hyphen chars
([\w.,#?^=%&:/~+#-]*[\w#?^=%&/~+#-])? - an optional Group 2: zero or more word chars or chars from the .,#?^=%&:/~+#- set and then a word char or a char from the #?^=%&/~+#- set.

As an alternative solution, you could match match what you don't want to remove and capture what you do want to remove.
You can use a callback with Replace and test for the existence of group 1. If it is there, return an empty string. If it is not there, return the match to leave it unchanged.
\S*/images\S*|(?<!\S)((?:(?:https?|ftp)://)?[\w-]+(?:(?:\.[\w-]+)+)(?:[\w.,#?^=%&:/~+#-]*[\w#?^=%&/~+#-])?)
Explanation
\S*/images\S* Match /images preceded and followed by optional non whitespace chars that your want to keep
| Or
(?<!\S) Assert a whitespace boundary to the left
((?:(?:https?|ftp)://)?[\w-]+(?:(?:\.[\w-]+)+)(?:[\w.,#?^=%&:/~+#-]*[\w#?^=%&/~+#-])?) The pattern that you tried with some minor changes to make it a bit shorter
Regex demo (Click on the Table tab to see the matches)
For example
var s = #"this is my text here is a link http://dfdf.com and my is /images/dd.gif";
var regex = new Regex(#"\S*/images\S*|(?<!\S)((?:(?:https?|ftp)://)?[\w-]+(?:(?:\.[\w-]+)+)(?:[\w.,#?^=%&:/~+#-]*[\w#?^=%&/~+#-])?)");
var result = regex.Replace(s, match => match.Groups[1].Success ? "" : match.Value);
Console.WriteLine(result);
See a C# demo

Related

Alternate regex with -SDR?

I have the following regex in my c#:
(?<!\w)M20A\w+
Actual code:
string regex = $#"(?<!\w){prefix}\w+";
Notice the prefix var matches strings such as M20A and X50G.
It perfectly matches the following cases:
M20A0820
M20A1234
M20A7U8V
But now I got a new requirement from the business to match, for example:
M20A-SDR
It will be the prefix followed by the exact string "-SDR". Not just a dash followed by 3 alphanumerics, but literally "-SDR". The existing matches need to still work, but prefix + "-SDR" must also be matched.
What would be the regex that would match the following:
M20A0820
M20A1234
M20A7U8V
M20A-SDR
You may use
string regex = $#"(?<!\w){prefix}\w*(?:-SDR)?";
See the regex demo.
Or, to match as a whole word, you may use word boundaries:
string regex = $#"\b{prefix}\w*(?:-SDR)?\b";
See this regex demo
The \b word boundary at the start will work if all the values in prefix start with a word char, a letter, digit or _. The word boundary at the end will make sense if after -SDR, there can be no more word chars.
The (?:-SDR)? will match a -SDR string optonally.
Details
\b - word boundary
M20A - a literal string
\w* - 0+ word chars
(?:-SDR)? - a non-capturing group that matches 1 or 0 times (as there is a ? after it) an -SDR substring
\b - a word boundary.

Remove everything that doesn't match

string line = "Rok rok irrelevant text irrelevant;text.irrelevant,text";
string NewLine = Regex.Replace(line, #"\b[rR]\w*", "");
Right now it replaces every word starting with r/R with a blank space, but I want to make everything a blank space EXCEPT words starting with r/R.
Edit
It seems all you want is to extract words starting with r or R and join them with a space. In this case, use a mere \b[rR]\w* regex and the following code:
var result = string.Join(" ", Regex.Matches(line, #"\b[rR]\w*").Cast<Match>().Select(x => x.Value));
See the C# demo.
Original answer
You may use a negative lookahead after a word boundary:
\b(?![rR])\w+
^^^^^^^^
Note that the + quantifier is better here since you want to remove at least 1 char found.
Or, in case you also want to remove all non-word chars after the found word, use
\b(?![rR])\w+\W*
See the regex demo #1 and regex demo #2.
If you want to remove any non-word chars before and after a qualifying word, use
var result = Regex.Replace(line, #"\W*\b(?![rR])\w+\W*", " ").Trim();
It will remove all non-word chars before a word not starting with r and R and after it.
Details
\b - a word boundary
(?![rR]) - a negative lookahead that will fail the match if, immediately to the right of the current location, there is r or R
\w+ - 1+ word chars
\W* - 0+ non-word chars.

C# equivalent for this regex pattern

I have this regular expression pattern: .{2}\#.{2}\K|\..*(*SKIP)(?!)|.(?=.*\.)
It works perfectly to convert to replace the matches to get
trabc#abtrec.com.lo => ***bc#ab*****.com.lo
demomail#demodomain.com => ******il#de*********.com
But when I try to use it on C# the \K and the (*SKIP) and (*F) are not allowed.
what will be the c# version of this pattern? or do you know a simpler way to mask the email without the unsupported pattern entries?
Demo
UPDATE:
(*SKIP): this verb causes the match to fail at the current starting position in the subject if the rest of the pattern does not match
(*F): Forces a matching failure at the given position in the pattern (the same as (?!)
Try this regex:
\w(?=.{2,}#)|(?<=#[^\.]{2,})\w
Click for Demo
Explanation:
\w - matches a word character
(?=.{2,}#) - positive lookahead to find the position immediately followed by 2+ occurrences of any character followed by #
| - OR
(?<=#[^\.]{2,}) - positive lookbehind to find the position immediately preceded by # followed by 2+ occurrences of any character that is not a .
\w - matches a word character.
Replace each match with a *
You can achieve the same result with a regex that matches items in one block, and applying a custom match evaluator:
var res = Regex.Replace(
s
, #"^.*(?=.{2}\#.{2})|(?<=.{2}\#.{2}).*(?=.com.*$)"
, match => new string('*', match.ToString().Length)
);
The regex has two parts:
The one on the left ^.*(?=.{2}\#.{2}) matches the user name portion except the last two characters
The one on the right (?<=.{2}\#.{2}).*(?=.com.*$) matches the suffix of the domain up to the ".com..." ending.
Demo.

Regular expression matching c# without some tags

I want to match exact and prefix wildcard match but there's one condition that It should not be surrounded by a particular tag.
For example: if the word to match is test, the regular expression should match
test, testing,tester ,testing.aspx but it should not match test</x> and testing</x>, tester</x> and other words with prefix test
I came up with a regex which is matching test</x> too.
string regex = string.Format("\\b{0}(\\S)*(?!</x>)", "test");
Can somebody help me in correcting my regex?
The \btest(\S)*(?!</x>) pattern matches test</x> because \btest finds a word starting with test, then matches and repeatedly captures any 0+ non-whitespace chars, and then checks if there is no </x> immediately to the right of the current location. Since (\S)* matches the whole </x> at once the negative lookahead checks for </x> when the regex index is already placed after this </x> - and thus it returns true and the match is a success.
Yo may use
string regex = string.Format(#"(?>\b{0}[^<\s]*)(?!</x>)", "test");
// or, beginning with C#6
// var regex = $#"(?>\b{SearchWord}[^<\s]*)(?!</x>)";
See the regex demo
Now, it will match like this:
(?>\btest[^<\s]*) - an atomic group matching
\b - a word boundary
test - search term
[^<\s]* - 0+ chars other than < and whitespace
(?!</x>) - a negative lookahead that fails the match if there is a </x> char sequence immediately to the right of the current location

How to insert spaces between characters using Regex?

Trying to learn a little more about using Regex (Regular expressions). Using Microsoft's version of Regex in C# (VS 2010), how could I take a simple string like:
"Hello"
and change it to
"H e l l o"
This could be a string of any letter or symbol, capitals, lowercase, etc., and there are no other letters or symbols following or leading this word. (The string consists of only the one word).
(I have read the other posts, but I can't seem to grasp Regex. Please be kind :) ).
Thanks for any help with this. (an explanation would be most useful).
You could do this through regex only, no need for inbuilt c# functions.
Use the below regexes and then replace the matched boundaries with space.
(?<=.)(?!$)
DEMO
string result = Regex.Replace(yourString, #"(?<=.)(?!$)", " ");
Explanation:
(?<=.) Positive lookbehind asserts that the match must be preceded by a character.
(?!$) Negative lookahead which asserts that the match won't be followed by an end of the line anchor. So the boundaries next to all the characters would be matched but not the one which was next to the last character.
OR
You could also use word boundaries.
(?<!^)(\B|b)(?!$)
DEMO
string result = Regex.Replace(yourString, #"(?<!^)(\B|b)(?!$)", " ");
Explanation:
(?<!^) Negative lookbehind which asserts that the match won't be at the start.
(\B|\b) Matches the boundary which exists between two word characters and two non-word characters (\B) or match the boundary which exists between a word character and a non-word character (\b).
(?!$) Negative lookahead asserts that the match won't be followed by an end of the line anchor.
Regex.Replace("Hello", "(.)", "$1 ").TrimEnd();
Explanation
The dot character class matches every character of your string "Hello".
The paranthesis around the dot character are required so that we could refer to the captured character through the $n notation.
Each captured character is replaced by the replacement string. Our replacement string is "$1 " (notice the space at the end). Here $1 represents the first captured group in the input, therefore our replacement string will replace each character by that character plus one space.
This technique will add one space after the final character "o" as well, so we call TrimEnd() to remove that.
A demo can be seen here.
For the enthusiast, the same effect can be achieve through LINQ using this one-liner:
String.Join(" ", YourString.AsEnumerable())
or if you don't want to use the extension method:
String.Join(" ", YourString.ToCharArray())
It's very simple. To match any character use . dot and then replace with that character along with one extra space
Here parenthesis (...) are used for grouping that can be accessed by $index
Find what : "(.)"
Replace with "$1 "
DEMO

Categories