Removing hidden characters from within strings

Removing hidden characters from within strings - c#

My problem:
I have a .NET application that sends out newsletters via email. When the newsletters are viewed in outlook, outlook displays a question mark in place of a hidden character it can’t recognize. These hidden character(s) are coming from end users who copy and paste html that makes up the newsletters into a form and submits it. A c# trim() removes these hidden chars if they occur at the end or beginning of the string. When the newsletter is viewed in gmail, gmail does a good job ignoring them. When pasting these hidden characters in a word document and I turn on the “show paragraph marks and hidden symbols” option the symbols appear as one rectangle inside a bigger rectangle. Also the text that makes up the newsletters can be in any language, so accepting Unicode chars is a must. I've tried looping through the string to detect the character but the loop doesn't recognize it and passes over it. Also asking the end user to paste the html into notepad first before submitting it is out of the question.
My question:
How can I detect and eliminate these hidden characters using C#?

You can remove all control characters from your input string with something like this:
string input; // this is your input string
string output = new string(input.Where(c => !char.IsControl(c)).ToArray());
Here is the documentation for the IsControl() method.
Or if you want to keep letters and digits only, you can also use the IsLetter and IsDigit function:
string output = new string(input.Where(c => char.IsLetter(c) || char.IsDigit(c)).ToArray());

I usually use this regular expression to replace all non-printable characters.
By the way, most of the people think that tab, line feed and carriage return are non-printable characters, but for me they are not.
So here is the expression:
string output = Regex.Replace(input, #"[^\u0009\u000A\u000D\u0020-\u007E]", "*");
^ means if it's any of the following:
\u0009 is tab
\u000A is linefeed
\u000D is carriage return
\u0020-\u007E means everything from space to ~ -- that is, everything in ASCII.
See ASCII table if you want to make changes. Remember it would strip off every non-ASCII character.
To test above you can create a string by yourself like this:
string input = string.Empty;
for (int i = 0; i < 255; i++)
{
input += (char)(i);
}

What best worked for me is:
string result = new string(value.Where(c => char.IsLetterOrDigit(c) || (c >= ' ' && c <= byte.MaxValue)).ToArray());
Where I'm making sure the character is any letter or digit, so that I don't ignore any non English letters, or if it is not a letter I check whether it's an ascii character that is greater or equal than Space to make sure I ignore some control characters, this ensures I don't ignore punctuation.
Some suggest using IsControl to check whether the character is non printable or not, but that ignores Left-To-Right mark for example.

new string(input.Where(c => !char.IsControl(c)).ToArray());
IsControl misses some control characters like left-to-right mark (LRM) (the char which commonly hides in a string while doing copy paste). If you are sure that your string has only digits and numbers then you can use IsLetterOrDigit
new string(input.Where(c => char.IsLetterOrDigit(c)).ToArray())
If your string has special characters, then
new string(input.Where(c => c < 128).ToArray())

You can do this:
var hChars = new char[] {...};
var result = new string(yourString.Where(c => !hChars.Contains(c)).ToArray());

TLDR Answer
Use this Regex...
\P{Cc}\P{Cn}\P{Cs}
Like this...
var regex = new Regex(#"![\P{Cc}\P{Cn}\P{Cs}]");
TLDR Explanation
\P{Cc} : Do not match control characters.
\P{Cn} : Do not match unassigned characters.
\P{Cs} : Do not match UTF-8-invalid characters.
Working Demo
In this demo, I use this regex to search the string "Hello, World!". That weird character at the end is (char)4 — this is the character for END TRANSMISSION.
using System;
using System.Text.RegularExpressions;
public class Test {
public static void Main() {
var regex = new Regex(#"![\P{Cc}\P{Cn}\P{Cs}]");
var matches = regex.Matches("Hello, World!" + (char)4);
Console.WriteLine("Results: " + matches.Count);
foreach (Match match in matches) {
Console.WriteLine("Result: " + match);
}
}
}
Full Working Demo at IDEOne.com
The output from the above code:
Results: 1
Result: !
Alternatives
\P{C} : Match only visible characters. Do not match any invisible characters.
\P{Cc} : Match only non-control characters. Do not match any control characters.
\P{Cc}\P{Cn} : Match only non-control characters that have been assigned. Do not match any control or unassigned characters.
\P{Cc}\P{Cn}\P{Cs} : Match only non-control characters that have been assigned and are UTF-8 valid. Do not match any control, unassigned, or UTF-8-invalid characters.
\P{Cc}\P{Cn}\P{Cs}\P{Cf} : Match only non-control, non-formatting characters that have been assigned and are UTF-8 valid. Do not match any control, unassigned, formatting, or UTF-8-invalid characters.
Source and Explanation
Take a look at the Unicode Character Properties available that can be used to test within a regex. You should be able to use these regexes in Microsoft .NET, JavaScript, Python, Java, PHP, Ruby, Perl, Golang, and even Adobe. Knowing Unicode character classes is very transferable knowledge, so I recommend using it!

string output = new string(input.Where(c => !char.IsControl(c)).ToArray());
This will surely solve the problem. I had a non printable substitute characer(ASCII 26) in a string which was causing my app to break and this line of code removed the characters

I used this quick and dirty oneliner to clean some input from LTR/RTL marks left over by the broken Windows 10 calculator app. It's probably a far cry from perfect but good enough for a quick fix:
string cleaned = new string(input.Where(c => !char.IsControl(c) && (char.IsLetterOrDigit(c) || char.IsPunctuation(c) || char.IsSeparator(c) || char.IsSymbol(c) || char.IsWhiteSpace(c))).ToArray());

I experienced an error with the AWS S3 SDK
"Target resource path[name -‎3.‎30.‎2022 -‎15‎.‎27.‎00.pdf] has bidirectional characters, which are not supportedby System.Uri and thus cannot be handled by the .NET SDK"
The filename in my instance contained Unicode Character 'LEFT-TO-RIGHT MARK' (U+200E) between the dots. These were not visible in html or in Notepad++. When the text was pasted into Visual Studio 2019 Editor, the unicode text was visible and I was able to solve the issue.
The problem was solved by replacing all control and other non-printable characters from the filename using the following script.
var input = Regex.Replace(s, #"\p{C}+", string.Empty);
Credit Source: https://stackoverflow.com/a/40568888/1165173

Related

C# regex to remove non - printable characters, and control characters, in a text that has a mix of many different languages, unicode letters

i would appreciate your help on this, since i do not know which range of characters to use, or if there is a character class like [[:cntrl:]] that i have found in ruby?
by means of non printable, i mean delete all characters that are not shown in ie output, when one prints the input string. Please note, i look for a c# regex, i do not have a problem with my code

You may remove all control and other non-printable characters with
s = Regex.Replace(s, #"\p{C}+", string.Empty);
The \p{C} Unicode category class matches all control characters, even those outside the ASCII table because in .NET, Unicode category classes are Unicode-aware by default.
Breaking it down into subcategories
To only match basic control characters you may use \p{Cc}+, see 65 chars in the Other, Control Unicode category. It is equal to a [\u0000-\u0008\u000E-\u001F\u007F-\u0084\u0086-\u009F \u0009-\u000D \u0085]+ regex.
To only match 161 other format chars including the well-known soft hyphen (\u00AD), zero-width space (\u200B), zero-width non-joiner (\u200C), zero-width joiner (\u200D), left-to-right mark (\u200E) and right-to-left mark (\u200F) use \p{Cf}+. The equivalent including astral place code points is a (?:[\xAD\u0600-\u0605\u061C\u06DD\u070F\u08E2\u180E\u200B-\u200F\u202A-\u202E\u2060-\u2064\u2066-\u206F\uFEFF\uFFF9-\uFFFB]|\uD804[\uDCBD\uDCCD]|\uD80D[\uDC30-\uDC38]|\uD82F[\uDCA0-\uDCA3]|\uD834[\uDD73-\uDD7A]|\uDB40[\uDC01\uDC20-\uDC7F])+ regex.
To match 137,468 Other, Private Use control code points you may use \p{Co}+, or its equivalent including astral place code points, (?:[\uE000-\uF8FF]|[\uDB80-\uDBBE\uDBC0-\uDBFE][\uDC00-\uDFFF]|[\uDBBF\uDBFF][\uDC00-\uDFFD])+.
To match 2,048 Other, Surrogate code points that include some emojis, you may use \p{Cs}+, or [\uD800-\uDFFF]+ regex.

You can try with :
string s = "Täkörgåsmrgås";
s = Regex.Replace(s, #"[^\u0000-\u007F]+", string.Empty);
Updated answer after comments:
Documentation about non-printable character:
https://en.wikipedia.org/wiki/Control_character
Char.IsControl Method:
https://msdn.microsoft.com/en-us/library/system.char.iscontrol.aspx
Maybe you can try:
string input; // this is your input string
string output = new string(input.Where(c => !char.IsControl(c)).ToArray());

To remove all control and other non-printable characters
Regex.Replace(s, #"\p{C}+", String.Empty);
To remove the control characters only (if you don't want to remove the emojis 😎)
Regex.Replace(s, #"\p{Cc}+", String.Empty);

you can try this:
public static string TrimNonAscii(this string value)
{
string pattern = "[^ -~]*";
Regex reg_exp = new Regex(pattern);
return reg_exp.Replace(value, "");
}

Detect Special Characters in a text in C#

In my program, I'm going to process some strings. These strings can be from any language.(eg. Japanese, Portuguese, Mandarin, English and etc.)
Sometime these strings may contain some HTML special characters like trademark symbol(™), registered symbol(®), Copyright symbol(©) and etc.
Then I am going to generate an Excel sheet with these details. But when these is a special character, even though the excel file is created it can not be open since it is appeared to be corrupted.
So what I did is encode string before writing into excel. But what happened next is, all the strings except from English were encoded. The picture shows that asset description which is a Japanese language text is also converted into encoded text. But I wanted to encoded special characters only.
゜祌づ りゅ氧廩, 駤びょ菣 鏥こ埣槎で is converted to ゜祌づ りゅ氧廩, 駤びょ菣 鏥こ埣槎で But I wanted only to encoded special characters.
So what I need is to identify whether the string contains that kind of special character.Since I am dealing with multiple languages, is there any possible way to identify whether the string contain a HTML special characters?

Try this using the Regex.IsMatch Method:
string str = "*!#©™®";
var regx = new Regex("[^a-zA-Z0-9_.]");
if (regx.IsMatch(str))
{
Console.WriteLine("Special character(s) detected.");
}
See the Demo

Try the Regex.Replace method:
// Replace letters and numbers with nothing then check if there are any characters left.
// The only characters will be something like $, #, ^, or $.
//
// [\p{L}\p{Nd}]+ checks for words/numbers in any language.
if (!string.IsNullOrWhiteSpace(Regex.Replace(input, #"([\p{L}\p{Nd}]+)", "")))
{
// Do whatever with the string.
}
Detection demo.

I suppose that you could start by treating your string as a Char array
https://msdn.microsoft.com/en-us/library/system.char(v=vs.110).aspx
Then you can examine each character in turn. Indeed on a second read of that manual page why not use this:
string s = "Sometime these strings may contain some HTML special characters like trademark symbol(™), registered symbol(®), Copyright symbol(©) and etc.゜祌づ りゅ氧廩, 駤びょ菣 鏥こ埣槎で";
Char[] ca = s.ToCharArray();
foreach (Char c in ca){
if (Char.IsSymbol(c))
Console.WriteLine("found symbol:{0} ",c );
}

Regex to allow some special characters c#

I have to check whether a string contains special characters or not but I can allow these 5 special characters in it .()_-
i have written my regex as
var specialCharacterSet = "^[()_-.]";
var test = Regex.Match("a!", specialCharacterSet);
var isValid = test.Success;
but its throwing an:
error parsing "^[()_-.]" - [x-y] range in reverse order.

You have specified a range with -. Place it at the end:
[()_.-]
Otherwise the range is not correct: the lower boundary symbol _ appears later in the character table than the upper bound symbol .:
Also, if you plan to check if any of the character inside a string belongs to this set, you should remove ^ that checks only at the beginning of a string.
To test if a string meets some pattern, use Regex.IsMatch:
Indicates whether the regular expression finds a match in the input string.
var specialCharacterSet = "[()_.-]";
var test = Regex.IsMatch("a!", specialCharacterSet);
UPDATE
To accept any string value that doesnt contains the five characters, you can use
var str = "file.na*me";
if (!Regex.IsMatch(str, #"[()_.-]"))
Console.WriteLine(string.Format("{0}: Valid!", str));
else
Console.WriteLine(string.Format("{0}: Invalid!", str));
See IDEONE demo

You can use ^[()_\-.] or ^[()_.-] if you use special characters then best use \ before any special characters (which are used in regex special char.).

[()_.-]
Keep - at end or escape it to avoid it forming an invalid range.- inside a character class forms a range.Here
_ is decimal 95
. is decimal 46.
So it is forming an invalid range from 95 to 46

var specialCharacterSet = "^[()_.-]";
var test = Regex.IsMatch("a!", specialCharacterSet);
Console.WriteLine(test);
Console.ReadLine();

Convert all special characters in a pattern to text using Regex.Escape(). Suppose you already have using System.Text.RegularExpressions;
string pattern = Regex.Escape("[");
then check like this
if (Regex.IsMatch("ab[c", pattern)) Console.WriteLine("found");
Microsoft doesn't tell about escape in the tutorial. I learned it from Perl.

The best way in terms of C# is [()_\-\.], because . and - are reserved characters for regex. You need to use an escape character before these reserved characters.

How would one trim all non-alphanumeric and numeric characters from the beginning and end of a string?

EDIT: I changed the title to reflect specifically what it is I'm trying to do.
Is there a way to retrieve all alphanumeric (or preferably, just the alphabet) characters for the current culture in .NET? My scenario is that I have several strings that I need to remove all numerals and non-alphabet characters from, and I'm not quite sure how I would implement this while honoring the alphabet of languages other than English (short of creating arrays of all alphabet characters for all supported languages of .NET, or at least the languages of our current clients lol)
UPDATE:
Specifically, what I'm trying to do is trim all non-alphabet chars from the start of the string up until the first alphabet character, and then from the last alphabet character to the end of the string. So for a random example in en-US, I want to turn:
()&*1#^#47*^#21%Littering aaaannnnd(*&^1#*32%#**)7(#9&^
into the following:
Littering aaaannnnd
This would be simple enough to do for English since it's my first language, but really in any culture I need to be able to remove numerals and other non-alphanumeric characters from the string.

string something = "()&*1#^#47*^#21%Littering aaaannnndóú(*&^1#*32%#**)7(#9&^";
string somethingNew = Regex.Replace(something, #"[^\p{L}-\s]+", "");
Is this what you're looking for?
Edit: Added to allow other languages characters. This will output Littering aaaannnndóú

Using regex method, this should work out:
string input = "()&*1#^#47*^#21%Littering aaaannnnd(*&^1#*32%#**)7(#9&^";
string result = Regex.Replace(input, "(?:^[^a-zA-Z]*|[^a-zA-Z]*$)", ""); //TRIM FROM START & END

Without using regex:
In Java, you could do:
while (true) {
if (word.length() == 0) {
return ""; // bad
}
if (!Character.isLetter(word.charAt(0))) {
word = word.substring(1);
continue; // so we are doing front first
}
if (!Character.isLetter(word.charAt(word.length()-1))) {
word = word.substring(0, word.length()-1);
continue; // then we are doing end
}
break; // if front is done, and end is done
}
If you are using something else, then java, substituting Character.isLetter is very straight forward, just search for character encoding and you will find the integer values for alphabetic characters, and you can use that to do it.

Regular Expression: single word

I want to check in a C# program, if a user input is a single word. The word my only have characters A-Z and a-z. No spaces or other characters.
I try [A-Za-z]* , but this doesn't work. What is wrong with this expression?
Regex regex = new Regex("[A-Za-z]*");
if (!regex.IsMatch(userinput);)
{
...
}
Can you recomend website with a comprensiv list of regex examples?!

It probably works, but you aren't anchoring the regular expression. You need to use ^ and $ to anchor the expression to the beginning and end of the string, respectively:
Regex regex = new Regex("^[A-Za-z]+$");
I've also changed * to + because * will match 0 or more times while + will match 1 or more times.

You should add anchors for start and end of string: ^[A-Za-z]+$

Regarding the question of regex examples have a look at http://regexlib.com/.
For the regex, have a look at the special characters ^ and $, which represent starting and ending of string. This site can come in handy when constructing regexes in the future.

The asterisk character in regex specifies "zero or more of the preceding character class".
This explains why your expression is failing, because it will succeed if the string contains zero or more letters.
What you probably intended was to have one or more letters, in which case you should use the plus sign instead of the asterisk.
Having made that change, now it will fail if you enter a string that doesn't contain any letters, as you intended.
However, this still won't work for you entirely, because it will allow other characters in the string. If you want to restrict it to only letters, and nothing else, then you need to provide the start and end anchors (^ and $) in your regex to make the expression check that the 'one or more letters' is attached to the start and end of the string.
^[a-zA-Z]+$
This should work as intended.
Hope that helps.
For more information on regex, I recommend http://www.regular-expressions.info/reference.html as a good reference site.

I don't know what the C#'s regex syntax is, but try [A-Za-z]+.

Try ^[A-Za-z]+$ If you don't include the ^$ it will match on any part of the string that has a alpha characters in it.

I know the question is only about strictly alphabetic input, but here's an interesting way of solving this which does not break on accented letters and other such special characters.
The regex "^\b.+?\b" will match the first word on the start of a string, but only if the string actually starts with a valid word character. Using that, you can simply check if A) the string matches, and B) the length of the matched string equals your full string's length:
public Boolean IsSingleWord(String userInput)
{
Regex firstWordRegex = new Regex("^\\b.+?\\b");
Match firstWordMatch = firstWordRegex.Match(userInput);
return firstWordMatch.Success && firstWordMatch.Length == userInput.Length;
}

The other persons have wrote how to resolve the problem you know. Now I'll speak about the problem you perhaps don't know: diacritics :-) Your solution doesn't support àèéìòù and many other letters. A correct solution would be:
^(\p{L}\p{M}*)+$
where \p{L} is any letter plus \p{M}* that is 0 or more diacritic marks (in unicode diacritics can be "separated" from base letters, so you can have something like a + ` = à or you can have precomposed characters like the standard à)

if you just need the characters a-zA-Z you could simply iterate over the characters and compare the single characters if they are inside your range
for example:
for each character c: ('a' <= c && c <= 'z') || ('A' <= c && c <= 'Z')
This could increase your performance

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Removing hidden characters from within strings - c#

You can do this: var hChars = new char[] {...}; var result = new string(yourString.Where(c => !hChars.Contains(c)).ToArray());

string output = new string(input.Where(c => !char.IsControl(c)).ToArray()); This will surely solve the problem. I had a non printable substitute characer(ASCII 26) in a string which was causing my app to break and this line of code removed the characters

Related

C# regex to remove non - printable characters, and control characters, in a text that has a mix of many different languages, unicode letters

Detect Special Characters in a text in C#

Regex to allow some special characters c#

How would one trim all non-alphanumeric and numeric characters from the beginning and end of a string?

Regular Expression: single word

Categories

Resources