I'm looking for a regexp (or any other solution) that would let me replace all whitespace characters between specific non whitespace chars. Eg:
instance. method
instance .method
"instance" .method
instance. "method"
Is it possible?
EDIT:
In other words - I want to throw out whitespace if it's between letter and dot, dot and letter, quotation mark and dot or dot and quotation mark.
Using lookaheads and lookbehinds:
var regex = new Regex("(?<=[a-zA-Z])\\s+(?=\\.)|(?<=\\.)\\s+(?=[a-zA-Z])|(?<=\")\\s+(?=\\.)|(?<=\\.)\\s+(?=\")");
Console.WriteLine(regex.Replace("instance. method", ""));
Console.WriteLine(regex.Replace("instance .method", ""));
Console.WriteLine(regex.Replace("\"instance\" .method", ""));
Console.WriteLine(regex.Replace("instance. \"method\"", ""));
Result:
instance.method
instance.method
"instance".method
instance."method"
The regex has four parts:
(?<=[a-zA-Z])\s+(?=\.) //Matches [a-zA-Z] before and . after:
(?<=\.)\s+(?=[a-zA-Z]) //Matches . before and [a-zA-Z] after
(?<=")\s+(?=\.) //Matches " before and . after
(?<=\.)\s+(?=") //Matches . before and " after
I want to throw out whitespace if it's between letter and dot, dot and letter, quotation mark and dot or dot and quotation mark.
I would use something like this:
#"(?i)(?:(?<=\.) (?=[""a-z])|(?<=[""a-z]) (?=\.))"
regex101 demo
Or broken down:
(?i) // makes the regex case insensitive.
(?:
(?<=\.) // ensure there's a dot before the match
[ ] // space (enclose in [] if you use the expanded mode, otherwise, you don't need []
(?=[a-z""]) // ensure there's a letter or quote after the match
| // OR
(?<=[a-z""]) // ensure there's a letter or quote before the match
[ ] // space
(?=\.) // ensure there's a dot after the match
)
In a variable:
var reg = new Regex(#"(?i)(?:(?<=\.) (?=[""a-z])|(?<=[""a-z]) (?=\.))");
What you are looking for/to search on google is "Character LookAhead and LookBehind"... basically what you want to do is use RegEx to find all instances of whitespace characters or split the string by Whitespace (i prefer this one), and then look ahead and behind on each match and see if the char at those positions (previous and next) match your criteria. Then replace if necessary at that position.
Unfortunately i do not know of a "single statement" solution for what you are attempting to do.
Is this what you seek? (regex101 link)
[A-Za-z"](\s)\.|\.(\s)[A-Za-z"]
You can parse the string with word bounds:
^([\w\".]*)([\s])([\w\".]*)$
$1 will give you the first part.
$2 will give you the white space.
$3 will give you the end part.
Regex.Replace(instance, "([\\w\\d\".])\\s([\\w\\d\".])", "$1$2");
One alternate and simple solution would be to split the string on dot and then trim them.
Related
I have a regex that detect urls:
#"((http|ftp|https)\:\/\/)?([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,#?^=%&:/~+#-]*[\w#?^=%&/~+#-])?";
I am using it with regex.replace to remove urls from text.
I do not want it to replace any word that starts with /images
for example if the text is "this is my text here is a link http://dfdf.com and my is /images/dd.gif"
I need the http://dfdf.com replaces but not the /images/dd.gif
my regex replaces the dd.gif
so I want to negate any word after images/
any idea how can I fix this ?
You may start matching after a word boundary, and fail the match if it is immediately preceded with a whole "word" images/ using
\b(?<!\bimages/)(?:(?:http|ftp)s?://)?([\w-]+(?:\.[\w-]+)+)([\w.,#?^=%&:/~+#-]*[\w#?^=%&/~+#-])?
See the regex demo. Details:
\b - a word boundary
(?<!\bimages/) - no images/ as a whole word is allowed immediately on the left
(?:(?:http|ftp)s?://)? - an optional sequence of either http or ftp followed with an optional s and then :// substring
([\w-]+(?:\.[\w-]+)+) - Group 1: one or more word or hyphen chars followed with one or more sequences of a . and then one or more word or hyphen chars
([\w.,#?^=%&:/~+#-]*[\w#?^=%&/~+#-])? - an optional Group 2: zero or more word chars or chars from the .,#?^=%&:/~+#- set and then a word char or a char from the #?^=%&/~+#- set.
As an alternative solution, you could match match what you don't want to remove and capture what you do want to remove.
You can use a callback with Replace and test for the existence of group 1. If it is there, return an empty string. If it is not there, return the match to leave it unchanged.
\S*/images\S*|(?<!\S)((?:(?:https?|ftp)://)?[\w-]+(?:(?:\.[\w-]+)+)(?:[\w.,#?^=%&:/~+#-]*[\w#?^=%&/~+#-])?)
Explanation
\S*/images\S* Match /images preceded and followed by optional non whitespace chars that your want to keep
| Or
(?<!\S) Assert a whitespace boundary to the left
((?:(?:https?|ftp)://)?[\w-]+(?:(?:\.[\w-]+)+)(?:[\w.,#?^=%&:/~+#-]*[\w#?^=%&/~+#-])?) The pattern that you tried with some minor changes to make it a bit shorter
Regex demo (Click on the Table tab to see the matches)
For example
var s = #"this is my text here is a link http://dfdf.com and my is /images/dd.gif";
var regex = new Regex(#"\S*/images\S*|(?<!\S)((?:(?:https?|ftp)://)?[\w-]+(?:(?:\.[\w-]+)+)(?:[\w.,#?^=%&:/~+#-]*[\w#?^=%&/~+#-])?)");
var result = regex.Replace(s, match => match.Groups[1].Success ? "" : match.Value);
Console.WriteLine(result);
See a C# demo
So I have spent far too long on this and have tried tons of things with no luck. I think I am just bad at regex. I am trying to clean a string of ALL non alpha numeric characters but leaving spaces. I DO NOT WANT TO USE [^A-Za-z0-9 ]+ due language concerns.
Here are a few things I have tried:
cleaned_string = Regex.Replace(input_string, #"[^\w ]+[_]+);
cleaned_string = Regex.Replace(input_string, ([^\w ]+)([_]+));
cleaned_string = Regex.Replace(input_string, [^ \w?<!_]+);
Edit: Solved thanks to a very helpful person below.
My final product ended up being this: [_]+|[^\w\s]+
Thanks for all the help!
This should work for you
// Expression: _|[^\w\d ]
cleaned_string = Regex.Replace(input_string, #"/_|[^\w\d ]", "");
You may use
var res = Regex.Replace(s, #"[\W_-[\s]]+", string.Empty);
See the regex demo.
Look at \W pattern: it matches any non-word chars. Now, you want to exclude a whitespace matching pattern from \W - use character class subtraction: [\W-[\s]]. This matches any char \W matches except what \s matches. And to also match a _, just add it to the character class. Add + quantifier to remove whole consecutive chunks of matching chars at one go.
Details
[ - start of a character class
\W_ - any non-word or _ chars
-[\s] - except for chars matched with \s (whitespace) pattern
] - end of the character class
+ - one or more times.
This is in C#. I've been bugging my head but not luck so far.
So for example
123456BVC --> 123456BVC (keep the same)
123456BV --> 123456 (remove trailing letters)
12345V -- > 12345V (keep the same)
12345 --> 12345 (keep the same)
ABC123AB --> ABC123 (remove trailing letters)
It can start with anything.
I've tried #".*[a-zA-Z]{2}$" but no luck
This is in C# so that I always return a string removing the two trailing letters if they do exist and are not preceded with another letter.
Match result = Regex.Match(mystring, pattern);
return result.Value;
Your #".*[a-zA-Z]{2}$" regex matches any 0+ characters other than a newline (as many as possible) and 2 ASCII letters at the end of the string. You do not check the context, so the 2 letters are matched regardless of what comes before them.
You need a regex that will match the last two letters not preceded with a letter:
(?<!\p{L})\p{L}{2}$
See this regex demo.
Details:
(?<!\p{L}) - fails the match if a letter (\p{L}) is found before the current position (you may use [a-zA-Z] if you only want to deal with ASCII letters)
\p{L}{2} - 2 letters
$ - end of string.
In C#, use
var result = Regex.Replace(mystring, #"(?<!\p{L})\p{L}{2}$", string.Empty);
If you're looking to remove those last two letters, you can simply do this:
string result = Regex.Replace(originalString, #"[A-Za-z]{2}$", string.Empty);
Remember that in regex $ means the end of the input or the string before a newline.
Trying to learn a little more about using Regex (Regular expressions). Using Microsoft's version of Regex in C# (VS 2010), how could I take a simple string like:
"Hello"
and change it to
"H e l l o"
This could be a string of any letter or symbol, capitals, lowercase, etc., and there are no other letters or symbols following or leading this word. (The string consists of only the one word).
(I have read the other posts, but I can't seem to grasp Regex. Please be kind :) ).
Thanks for any help with this. (an explanation would be most useful).
You could do this through regex only, no need for inbuilt c# functions.
Use the below regexes and then replace the matched boundaries with space.
(?<=.)(?!$)
DEMO
string result = Regex.Replace(yourString, #"(?<=.)(?!$)", " ");
Explanation:
(?<=.) Positive lookbehind asserts that the match must be preceded by a character.
(?!$) Negative lookahead which asserts that the match won't be followed by an end of the line anchor. So the boundaries next to all the characters would be matched but not the one which was next to the last character.
OR
You could also use word boundaries.
(?<!^)(\B|b)(?!$)
DEMO
string result = Regex.Replace(yourString, #"(?<!^)(\B|b)(?!$)", " ");
Explanation:
(?<!^) Negative lookbehind which asserts that the match won't be at the start.
(\B|\b) Matches the boundary which exists between two word characters and two non-word characters (\B) or match the boundary which exists between a word character and a non-word character (\b).
(?!$) Negative lookahead asserts that the match won't be followed by an end of the line anchor.
Regex.Replace("Hello", "(.)", "$1 ").TrimEnd();
Explanation
The dot character class matches every character of your string "Hello".
The paranthesis around the dot character are required so that we could refer to the captured character through the $n notation.
Each captured character is replaced by the replacement string. Our replacement string is "$1 " (notice the space at the end). Here $1 represents the first captured group in the input, therefore our replacement string will replace each character by that character plus one space.
This technique will add one space after the final character "o" as well, so we call TrimEnd() to remove that.
A demo can be seen here.
For the enthusiast, the same effect can be achieve through LINQ using this one-liner:
String.Join(" ", YourString.AsEnumerable())
or if you don't want to use the extension method:
String.Join(" ", YourString.ToCharArray())
It's very simple. To match any character use . dot and then replace with that character along with one extra space
Here parenthesis (...) are used for grouping that can be accessed by $index
Find what : "(.)"
Replace with "$1 "
DEMO
Guys I hate Regex and I suck at writing.
I have a string that is space separated and contains several codes that I need to pull out. Each code is marked by beginning with a capital letter and ending with a number. The code is only two digits.
I'm trying to create an array of strings from the initial string and I can't get the regular expression right.
Here is what I have
String[] test = Regex.Split(originalText, "([a-zA-Z0-9]{2})");
I also tried:
String[] test = Regex.Split(originalText, "([A-Z]{1}[0-9]{1})");
I don't have any experience with Regex as I try to avoid writing them whenever possible.
Anyone have any suggestions?
Example input:
AA2410 F7 A4 Y7 B7 A 0715 0836 E0.M80
I need to pull out F7, A4, B7. E0 should be ignored.
You want to collect the results, not split on them, right?
Regex regexObj = new Regex(#"\b[A-Z][0-9]\b");
allMatchResults = regexObj.Matches(subjectString);
should do this. The \bs are word boundaries, making sure that only entire strings (like A1) are extracted, not substrings (like the A1 in TWA101).
If you also need to exclude "words" with non-word characters in them (like E0.M80 in your comment), you need to define your own word boundary, for example:
Regex regexObj = new Regex(#"(?<=^|\s)[A-Z][0-9](?=\s|$)");
Now A1 only matches when surrounded by whitespace (or start/end-of-string positions).
Explanation:
(?<= # Assert that we can match the following before the current position:
^ # Start of string
| # or
\s # whitespace.
)
[A-Z] # Match an uppercase ASCII letter
[0-9] # Match an ASCII digit
(?= # Assert that we can match the following after the current position:
\s # Whitespace
| # or
$ # end of string.
)
If you also need to find non-ASCII letters/digits, you can use
\p{Lu}\p{N}
instead of [A-Z][0-9]. This finds all uppercase Unicode letters and Unicode digits (like Ä٣), but I guess that's not really what you're after, is it?
Do you mean that each code looks like "A00"?
Then this is the regex:
"[A-Z][0-9][0-9]"
Very simple... By the way, there's no point writing {1} in a regex. [0-9]{1} means "match exactly one digit, which is exactly like writing [0-9].
Don't give up, simple regexes make perfect sense.
This should be ok:
String[] all_codes = Regex.Split(originalText, #"\b[A-Z]\d\b");
It gives you an array with all code starting with a capital letter followed by a digit, separated by an kind of word boundary (site space etc.)