How replace whitespaces (unicode to utf-8) with a regex C# - c#

I'm trying to do a replace regex in C #. The method that I'm trying to write replace some unicode character (spaces) by normal space in UTF-8.
Let me explain with code. I'm not good writting regular expressions, culture information and regex.
//This method replace white spaces in unicode by whitespaces UTF-8
public static string cleanUnicodeSpaces(string value)
{
//This first pattern works but, remove other special characteres
//For example: mark accents
//string pattern = #"[^\u0000-\u007F]+";
string cleaned = "";
string pattern = #"[^\u0020\u0009\u000D]+"; //Unicode characters
string replacement = ""; //Replace by UTF-8 space
Regex regex = new Regex(pattern);
cleaned = regex.Replace(value, replacement).Trim(); //Trim by quit spaces
return cleaned;
}
Unicode spaces
HT:U+0009 = Character tabulation
LF:U+000A = Line Feed
CR:U+000D = Carriage Return
What I doing wrong?
Source
Unicode Characteres: https://unicode-table.com/en
White Spaces:https://en.wikipedia.org/wiki/Whitespace_character
Regex: https://msdn.microsoft.com/es-es/library/system.text.regularexpressions.regex(v=vs.110).aspx
SOLUTION
Thanks to #wiktor-stribiżew and #mathias-r-jessen, solution:
string pattern = #"[\u0020\u0009\u000D\u00A0]+";
//I include \u00A0 for replace &nbsp

Your regex - [^\u0020\u0009\u000D]+ - is a negated character class that matches any 1+ chars other than a regular space (\u0020), tab (\u0009) and carriage return (\u000D). You actually are looking for a positive character class that would match one of the three chars you indicated (\x0A for a newline, \x0D for a carriage return and \x09 for a tab) in the question with a regular space (\x20).
You may just use
var res = Regex.Replace(s, #"[\x0A\x0D\x09]", " ");
See the regex demo

Related

Regex match with Arabic

i have a text in Arabic and i want to use Regex to extract numbers from it. here is my attempt.
String :
"ما المجموع:
1+2"
Match match = Regex.Match(text, "المجموع: ([^\\r\\n]+)", RegexOptions.IgnoreCase);
it will always return false. and groups.value will always return null.
expected output:
match.Groups[1].Value //returns (1+2)
The regex you wrote matches a word, then a colon, then a space and then 1 or more chars other than backslash, r and n.
You want to match the whole line after the word, colon and any amount of whitespace chars:
var text = "ما المجموع:\n1+2";
var result = Regex.Match(text, #"المجموع:\s*(.+)")?.Groups[1].Value;
Console.WriteLine(result); // => 1+2
See the C# demo
Other possible patterns:
#"المجموع:\r?\n(.+)" // To match CRLF or LF line ending only
#"المجموع:\n(.+)" // To match just LF ending only
Also, if you run the regex against a long multiline text with CRLF endings, it makes sense to replace .+ wit [^\r\n]+ since . in a .NET regex matches any chars but newlines, LF, and thus matches CR symbol.

Regex - Removing specific characters before the final occurance of #

So, I'm trying to remove certain characters [.&#] before the final occurance of an #, but after that final #, those characters should be allowed.
This is what I have so far.
string pattern = #"\.|\&|\#(?![^#]+$)|[^a-zA-Z#]";
string input = "username#middle&something.else#company.com";
// input, pattern, replacement
string result = Regex.Replace(input, pattern, string.Empty);
Console.WriteLine(result);
Output: usernamemiddlesomethingelse#companycom
This currently removes all occurances of the specified characters, apart from the final #. I'm not sure how to get this to work, help please?
You may use
[.&#]+(?=.*#)
Or, equivalent [.&#]+(?![^#]*$). See the regex demo.
Details
[.&#]+ - 1 or more ., & or # chars
(?=.*#) - followed with any 0+ chars (other than LF) as many as possible and then a #.
See the C# demo:
string pattern = #"[.&#]+(?=.*#)";
string input = "username#middle&something.else#company.com";
string result = Regex.Replace(input, pattern, string.Empty);
Console.WriteLine(result);
// => usernamemiddlesomethingelse#company.com
Just a simple solution (and alternative to complex regex) using Substring and LastIndexOf:
string pattern = #"[.#&]";
string input = "username#middle&something.else#company.com";
string inputBeforeLastAt = input.Substring(0, input.LastIndexOf('#'));
// input, pattern, replacement
string result = Regex.Replace(inputBeforeLastAt, pattern, string.Empty) + input.Substring(input.LastIndexOf('#'));
Console.WriteLine(result);
Try it with this fiddle.

Replacing special chars in a string with a single unique char

I have a string like so:
string inputStr = "Name*&^%LastName*##";
The following Regex will replace all the special chars with a '-'
Regex rgx = new Regex("[^a-zA-Z0-9 - _]");
someStr = rgx.Replace(someStr, "-");
That produces an output something like:
Name---LastName---
How do I replace '---' with a single '-' so the output looks like this:
Name-LastName
So the question is how do I replace all the special chars with a single '-'?
Regards.
Try this
Regex rgx = new Regex("[^a-zA-Z0-9 \- _]+");//note - character is escaped
or
Regex rgx = new Regex("[^a-zA-Z0-9 _-]+");//or use - as last character
But this will give Name-LastName- Is this okay or..?
If you don't need - at last position you can use the following code as well. Credit goes to
#MatthewStrawbridge. You can see in comments.
string someStr = rgx.Replace(inputStr, "-").TrimEnd('-');
will output Name-LastName.
Edit: As #pguardiario pointed in comments updated my answer to escape - since range([]) has special meaning for - character. If we need - as a literal we need to escape it or make it first or last character of the character class in order to behave as literal.

Replace whole word with a symbol using C# Regex

So I am trying to replace a word like #theplace or #theplaces using a Regex pattern like:
String Pattern = string.Format(#"\b{0}\b", PlaceName);
But when I do the replacement, it is not finding the pattern, I am guessing it is the # symbol that is the problem.
Can someone show me what I need to do to the Regex pattern to get it to work?
The following code will replace any instances of #thepalace or #thepalaces with <replacement>.
var result = Regex.Replace(
"some text with #thepalace or #thepalaces in it."
+ "\r\nHowever, #thepalacefoo and bar#thepalace won't be replaced.", // input
#"\B#thepalaces?\b", // pattern
"<replacement>"); // replacement text
The ? makes the preceding character, s, optional. I'm using the static Regex.Replace method.
The \b matches boundaries between word and non-word characters. \B matches every boundary that \b does not. See regex boundaries.
Result
some text with <replacement> or <replacement> in it.
However, #thepalacefoo and bar#thepalace won't be replaced.
Your problem* is the \b (word boundary) before the #. There is no word boundary between a space and an #.
You could just remove it, or replace it with a non-boundary, which is a capital B.
string Pattern = string.Format(#"\B{0}\b", PlaceName);
* assuming that PlaceName begins with #.
Try this:
string PlaceName="theplace", Replacement ="...";
string Pattern = String.Format(#"#\b{0}\b", PlaceName);
string Result = Regex.Replace(input, Pattern, Replacement);

Only keep the legal chars in a text using a .NET Regex

I have a list of legal characters and I want to remove all others chars from text.
// my legal chars. a-Z, numbers, space, _, - and percentage
string legalChars = "[\p{L}\p{Nd}_\- %]*"
string text = "[update], Text with {illegal} chars such as: !? {}";
I do find a lot of examples for removing illegal chars. I want to do the opposite.
How about:
String trimmed = Regex.Replace(input, #"[^\p{L}\p{Nd}_\- %]", "");
Or:
private static readonly Regex RemovalPattern
= new Regex(#"[^\p{L}\p{Nd}_\- %]");
...
string trimmed = RemovalPattern.Replace(input, "");
Note that your regex of legal characters currently doesn't include space, contrary to the comment.
Why not loop through the string yourselfa and check for each character if it's a legal char append the char to a new string (for example with stringbuilder)

Categories