Only keep the legal chars in a text using a .NET Regex - c#

I have a list of legal characters and I want to remove all others chars from text.
// my legal chars. a-Z, numbers, space, _, - and percentage
string legalChars = "[\p{L}\p{Nd}_\- %]*"
string text = "[update], Text with {illegal} chars such as: !? {}";
I do find a lot of examples for removing illegal chars. I want to do the opposite.

How about:
String trimmed = Regex.Replace(input, #"[^\p{L}\p{Nd}_\- %]", "");
Or:
private static readonly Regex RemovalPattern
= new Regex(#"[^\p{L}\p{Nd}_\- %]");
...
string trimmed = RemovalPattern.Replace(input, "");
Note that your regex of legal characters currently doesn't include space, contrary to the comment.

Why not loop through the string yourselfa and check for each character if it's a legal char append the char to a new string (for example with stringbuilder)

Related

Regex expression to replace text between start and end characters, but keep characters too

How can I replace text using c# regex that starts with "<" and ends with ">", but keep start and end characters and suround found match with {} brackets?
All occurrences in text should be replaced.
For example:
This is <my> long <text> should become
This is {<my>} long {<text>}.
Thomas is correct -- in this case, you do not need a regular expression. However, if you insist on using one (or want to expand this logic in the future to handle a range of characters), here it is:
var inputString = "This is <my> long <text>";
var newInputString = Regex.Replace(inputString, "(<[^>]+>)", "{$1}");
This regex assumes you are capturing at least one character between the angled brackets.
Why don't you use just replace;
string text = "This is <my> long <text>";
var replacedText = text.Replace("<", "{<").Replace(">", ">}");
If you have encoded text, you can decode it first;
string text = "This is <my> long <text&gt";
var replacedText = WebUtility.HtmlDecode(text).Replace("<", "{<").Replace(">", ">}");

How replace whitespaces (unicode to utf-8) with a regex C#

I'm trying to do a replace regex in C #. The method that I'm trying to write replace some unicode character (spaces) by normal space in UTF-8.
Let me explain with code. I'm not good writting regular expressions, culture information and regex.
//This method replace white spaces in unicode by whitespaces UTF-8
public static string cleanUnicodeSpaces(string value)
{
//This first pattern works but, remove other special characteres
//For example: mark accents
//string pattern = #"[^\u0000-\u007F]+";
string cleaned = "";
string pattern = #"[^\u0020\u0009\u000D]+"; //Unicode characters
string replacement = ""; //Replace by UTF-8 space
Regex regex = new Regex(pattern);
cleaned = regex.Replace(value, replacement).Trim(); //Trim by quit spaces
return cleaned;
}
Unicode spaces
HT:U+0009 = Character tabulation
LF:U+000A = Line Feed
CR:U+000D = Carriage Return
What I doing wrong?
Source
Unicode Characteres: https://unicode-table.com/en
White Spaces:https://en.wikipedia.org/wiki/Whitespace_character
Regex: https://msdn.microsoft.com/es-es/library/system.text.regularexpressions.regex(v=vs.110).aspx
SOLUTION
Thanks to #wiktor-stribiżew and #mathias-r-jessen, solution:
string pattern = #"[\u0020\u0009\u000D\u00A0]+";
//I include \u00A0 for replace &nbsp
Your regex - [^\u0020\u0009\u000D]+ - is a negated character class that matches any 1+ chars other than a regular space (\u0020), tab (\u0009) and carriage return (\u000D). You actually are looking for a positive character class that would match one of the three chars you indicated (\x0A for a newline, \x0D for a carriage return and \x09 for a tab) in the question with a regular space (\x20).
You may just use
var res = Regex.Replace(s, #"[\x0A\x0D\x09]", " ");
See the regex demo

Replacing doubleslash to single slash

In my c# application i want to convert a string characters to special characters.
My input string is "G\u00f6teborg" and i want the output as Göteborg.
I am using below code,
string name = "G\\u00f6teborg";
StringBuilder sb = new StringBuilder(name);
sb = sb.Replace(#"\\",#"\");
string name1 = System.Web.HttpUtility.HtmlDecode(sb.ToString());
Console.WriteLine(name1);
In the above code the double slash remains the same , it is not replacing to single slash, so after decoding i am getting the output as G\u00f6teborg .
Please help to find a solution for this.
Thanks in advance.
string name = "G\\u00f6teborg";
Just remove one of the backslashes:
string name = "G\u00f6teborg";
If you got the input from a user then you need to do more: it’s not enough to replace a backslash because that’s not how the characters are stored internally, the \uXXXX is an escape sequence representing a Unicode code point.
If you want to replace a user input escape sequence by a Unicode code point you need to parse the user input properly. You can use a regular expression for that:
MatchEvaluator replacer = m => ((char) int.Parse(m.Groups[1].Value, NumberStyles.AllowHexSpecifier)).ToString();
string result = Regex.Replace(name, #"\\u([a-fA-F0-9]{4})", replacer);
This matches each escape group (\u followed by four hex digits), extracts the hex digits, parses them and translates them to a character.

Removing numbers from text using C#

I have a text file for processing, which has some numbers. I want JUST text in it, and nothing else. I managed to remove the punctuation marks, but how do I remove the numbers? I want this using C# code.
Also, I want to remove words with length greater than 10. How do I do that using Reg Expressions?
You can do this with a regex:
string withNumbers = // string with numbers
string withoutNumbers = Regex.Replace(withNumbers, "[0-9]", "");
Use this regex to remove words with more than 10 characters:
[\w]{10, 100}
100 defines the max length to match. I don't know if there is a quantifier for min length...
Only letters and nothing else (because I see you also want to remove the punctuation marks)
Regex.IsMatch(input, #"^[a-zA-Z]+$");
You can also use string.Join:
string s = "asdasdad34534t3sdf43534";
s = string.Join(null, System.Text.RegularExpressions.Regex.Split(s, "[\\d]"));
The Regex.Replace method should do the trick.
// regex to match any digit
var regex = new Regex("\d");
// replace all matches in input with empty string
var output = regex.Replace(input, String.Empty);

How to remove extra returns and spaces in a string by regex?

I convert a HTML code to plain text.But there are many extra returns and spaces.How to remove them?
string new_string = Regex.Replace(orig_string, #"\s", "") will remove all whitespace
string new_string = Regex.Replace(orig_string, #"\s+", " ") will just collapse multiple whitespaces into one
I'm assuming that you want to
find two or more consecutive spaces and replace them with a single space, and
find two or more consecutive newlines and replace them with a single newline.
If that's correct, then you could use
resultString = Regex.Replace(subjectString, #"( |\r?\n)\1+", "$1");
This keeps the original "type" of whitespace intact and also preserves Windows line endings correctly. If you also want to "condense" multiple tabs into one, use
resultString = Regex.Replace(subjectString, #"( |\t|\r?\n)\1+", "$1");
To condense a string of newlines and spaces (any number of each) into a single newline, use
resultString = Regex.Replace(subjectString, #"(?:(?:\r?\n)+ +){2,}", #"\n");
I used a lot of algorithm for that. Every loop was good but this was clear and absolute.
//define what you want to remove as char
char tb = (char)9; //Tab char ascii code
spc = (char)32; //space char ascii code
nwln = (char)10; //New line char ascii char
yourstring.Replace(tb,"");
yourstring.Replace(spc,"");
yourstring.Replace(nwln,"");
//by defining chars, result was better.
You can use Trim() to remove the spaces and returns. In HTML the spaces is not important so you can omit them by using the Trim() method in System.String class.

Categories