c# Regex.Replace [^\w ] that also removes underscores? - c#

So I have spent far too long on this and have tried tons of things with no luck. I think I am just bad at regex. I am trying to clean a string of ALL non alpha numeric characters but leaving spaces. I DO NOT WANT TO USE [^A-Za-z0-9 ]+ due language concerns.
Here are a few things I have tried:
cleaned_string = Regex.Replace(input_string, #"[^\w ]+[_]+);
cleaned_string = Regex.Replace(input_string, ([^\w ]+)([_]+));
cleaned_string = Regex.Replace(input_string, [^ \w?<!_]+);
Edit: Solved thanks to a very helpful person below.
My final product ended up being this: [_]+|[^\w\s]+
Thanks for all the help!

This should work for you
// Expression: _|[^\w\d ]
cleaned_string = Regex.Replace(input_string, #"/_|[^\w\d ]", "");

You may use
var res = Regex.Replace(s, #"[\W_-[\s]]+", string.Empty);
See the regex demo.
Look at \W pattern: it matches any non-word chars. Now, you want to exclude a whitespace matching pattern from \W - use character class subtraction: [\W-[\s]]. This matches any char \W matches except what \s matches. And to also match a _, just add it to the character class. Add + quantifier to remove whole consecutive chunks of matching chars at one go.
Details
[ - start of a character class
\W_ - any non-word or _ chars
-[\s] - except for chars matched with \s (whitespace) pattern
] - end of the character class
+ - one or more times.

Related

Remove whitespace before or after a character with regex

I new in regex and i want to find a good solution for replacing whitespace before or after the / char in my sub string.
I have got string like
"Path01 /Some folder/ folder (2)"
i checked regex
#"\s?()\s?"
but this incorrect for me. I must get in output
Path01/Some folder/folder (2)
Can you help me?
Thanks!
You may use
#"\s*/\s*"
and replace with /.
See the regex demo
The pattern matches zero or more (*) whitespace chars (\s), then a / and then again 0+ whitespace chars.
C#:
var result = Regex.Replace(s, #"\s*/\s*", "/");

How to insert spaces between characters using Regex?

Trying to learn a little more about using Regex (Regular expressions). Using Microsoft's version of Regex in C# (VS 2010), how could I take a simple string like:
"Hello"
and change it to
"H e l l o"
This could be a string of any letter or symbol, capitals, lowercase, etc., and there are no other letters or symbols following or leading this word. (The string consists of only the one word).
(I have read the other posts, but I can't seem to grasp Regex. Please be kind :) ).
Thanks for any help with this. (an explanation would be most useful).
You could do this through regex only, no need for inbuilt c# functions.
Use the below regexes and then replace the matched boundaries with space.
(?<=.)(?!$)
DEMO
string result = Regex.Replace(yourString, #"(?<=.)(?!$)", " ");
Explanation:
(?<=.) Positive lookbehind asserts that the match must be preceded by a character.
(?!$) Negative lookahead which asserts that the match won't be followed by an end of the line anchor. So the boundaries next to all the characters would be matched but not the one which was next to the last character.
OR
You could also use word boundaries.
(?<!^)(\B|b)(?!$)
DEMO
string result = Regex.Replace(yourString, #"(?<!^)(\B|b)(?!$)", " ");
Explanation:
(?<!^) Negative lookbehind which asserts that the match won't be at the start.
(\B|\b) Matches the boundary which exists between two word characters and two non-word characters (\B) or match the boundary which exists between a word character and a non-word character (\b).
(?!$) Negative lookahead asserts that the match won't be followed by an end of the line anchor.
Regex.Replace("Hello", "(.)", "$1 ").TrimEnd();
Explanation
The dot character class matches every character of your string "Hello".
The paranthesis around the dot character are required so that we could refer to the captured character through the $n notation.
Each captured character is replaced by the replacement string. Our replacement string is "$1 " (notice the space at the end). Here $1 represents the first captured group in the input, therefore our replacement string will replace each character by that character plus one space.
This technique will add one space after the final character "o" as well, so we call TrimEnd() to remove that.
A demo can be seen here.
For the enthusiast, the same effect can be achieve through LINQ using this one-liner:
String.Join(" ", YourString.AsEnumerable())
or if you don't want to use the extension method:
String.Join(" ", YourString.ToCharArray())
It's very simple. To match any character use . dot and then replace with that character along with one extra space
Here parenthesis (...) are used for grouping that can be accessed by $index
Find what : "(.)"
Replace with "$1 "
DEMO

Replace whitespaces between specific characters in c#

I'm looking for a regexp (or any other solution) that would let me replace all whitespace characters between specific non whitespace chars. Eg:
instance. method
instance .method
"instance" .method
instance. "method"
Is it possible?
EDIT:
In other words - I want to throw out whitespace if it's between letter and dot, dot and letter, quotation mark and dot or dot and quotation mark.
Using lookaheads and lookbehinds:
var regex = new Regex("(?<=[a-zA-Z])\\s+(?=\\.)|(?<=\\.)\\s+(?=[a-zA-Z])|(?<=\")\\s+(?=\\.)|(?<=\\.)\\s+(?=\")");
Console.WriteLine(regex.Replace("instance. method", ""));
Console.WriteLine(regex.Replace("instance .method", ""));
Console.WriteLine(regex.Replace("\"instance\" .method", ""));
Console.WriteLine(regex.Replace("instance. \"method\"", ""));
Result:
instance.method
instance.method
"instance".method
instance."method"
The regex has four parts:
(?<=[a-zA-Z])\s+(?=\.) //Matches [a-zA-Z] before and . after:
(?<=\.)\s+(?=[a-zA-Z]) //Matches . before and [a-zA-Z] after
(?<=")\s+(?=\.) //Matches " before and . after
(?<=\.)\s+(?=") //Matches . before and " after
I want to throw out whitespace if it's between letter and dot, dot and letter, quotation mark and dot or dot and quotation mark.
I would use something like this:
#"(?i)(?:(?<=\.) (?=[""a-z])|(?<=[""a-z]) (?=\.))"
regex101 demo
Or broken down:
(?i) // makes the regex case insensitive.
(?:
(?<=\.) // ensure there's a dot before the match
[ ] // space (enclose in [] if you use the expanded mode, otherwise, you don't need []
(?=[a-z""]) // ensure there's a letter or quote after the match
| // OR
(?<=[a-z""]) // ensure there's a letter or quote before the match
[ ] // space
(?=\.) // ensure there's a dot after the match
)
In a variable:
var reg = new Regex(#"(?i)(?:(?<=\.) (?=[""a-z])|(?<=[""a-z]) (?=\.))");
What you are looking for/to search on google is "Character LookAhead and LookBehind"... basically what you want to do is use RegEx to find all instances of whitespace characters or split the string by Whitespace (i prefer this one), and then look ahead and behind on each match and see if the char at those positions (previous and next) match your criteria. Then replace if necessary at that position.
Unfortunately i do not know of a "single statement" solution for what you are attempting to do.
Is this what you seek? (regex101 link)
[A-Za-z"](\s)\.|\.(\s)[A-Za-z"]
You can parse the string with word bounds:
^([\w\".]*)([\s])([\w\".]*)$
$1 will give you the first part.
$2 will give you the white space.
$3 will give you the end part.
Regex.Replace(instance, "([\\w\\d\".])\\s([\\w\\d\".])", "$1$2");
One alternate and simple solution would be to split the string on dot and then trim them.

What will be the regular expression for parsing all words except last one?

I want to match the whole string except the last word. i.e. for
This is my house
the matched string should be This is my. What will be the regular expression for this?
This should do it:
^([\w ]*) [\w]+$
^ is start of line
([\w ]*) is your group of any number of letters and space
\w+ is a space followed by one or more word characters
$ is end of line.
You really don't need regexp for this task, delete everything from the last whitespace to the end of the string and you'll have what you need.
Personally, I'd go with something less opaque for such a simple task:
var words = Regex.Split("this is my house",#"\s");
var allButLastWord = string.Join(" ",words.Take(words.Length-1));
If it ends with a whitespace (as per example), you can define it as:
^(.*)\s
This will remove the whitespace at the end which I believe is desirable effect.
You could just use Split and do something like this -
var text = "This is my house";
var arr = text.Split(' ');
var newtext = String.Join(" ",arr.Take(arr.Length-1));
string GetAllWordsExceptLast (string original)
{
original = original.Trim();
return original.Substring(0, original.LastIndexOf(' '));
}
Unless you're really determined to use Regular Expressions. Just seems a little overkill for such a simple operation.
A robust solution trimming leading spaces:
(?!\s)(.+)(?=\s)
See it at work on regex101.com
Example: "5 hours", " 5 hours", "twenty five minutes", " twenty five minutes"
Explanation
Negative Lookahead (?!\s)
Assert that the Regex below does not match
\s matches any whitespace character (equal to [\r\n\t\f\v ])
1st Capturing Group (.+)
.+ matches any character (except for line terminators)
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)
Positive Lookahead (?=\s)
Assert that the Regex below matches
\s matches any whitespace character (equal to [\r\n\t\f\v ])

Regex: Match any punctuation character except . and _

Is there an easy way to match all punctuation except period and underscore, in a C# regex? Hoping to do it without enumerating every single punctuation mark.
Use Regex Subtraction
[\p{P}-[._]]
See the .NET Regex documentation. I'm not sure if other flavors support it.
C# example
string pattern = #"[\p{P}\p{S}-[._]]"; // added \p{S} to get ^,~ and ` (among others)
string test = #"_""'a:;%^&*~`bc!##.,?";
MatchCollection mx = Regex.Matches(test, pattern);
foreach (Match m in mx)
{
Console.WriteLine("{0}: {1} {2}", m.Value, m.Index, m.Length);
}
Explanation
The pattern is a Character Class Subtraction. It starts with a standard character class like [\p{P}] and then adds a Subtraction Character Class like -[._], which says to remove the . and _. The subtraction is placed inside the [ ] after the standard class guts.
The answers so far do not respect ALL punctuation. This should work:
(?![\._])\p{P}
(Explanation: Negative lookahead to ensure that neither . nor _ are matched, then match any unicode punctuation character.)
Here is something a little simpler. Not words or white-space (where words include A-Za-z0-9 AND underscore).
[^\w\s.]
You could possibly use a negated character class like this:
[^0-9A-Za-z._\s]
This includes every character except those listed. You may need to exclude more characters (such as control characters), depending on your ultimate requirements.

Categories