Detect Special Characters in a text in C# - c#

In my program, I'm going to process some strings. These strings can be from any language.(eg. Japanese, Portuguese, Mandarin, English and etc.)
Sometime these strings may contain some HTML special characters like trademark symbol(™), registered symbol(®), Copyright symbol(©) and etc.
Then I am going to generate an Excel sheet with these details. But when these is a special character, even though the excel file is created it can not be open since it is appeared to be corrupted.
So what I did is encode string before writing into excel. But what happened next is, all the strings except from English were encoded. The picture shows that asset description which is a Japanese language text is also converted into encoded text. But I wanted to encoded special characters only.
゜祌づ りゅ氧廩, 駤びょ菣 鏥こ埣槎で is converted to ゜祌づ りゅ氧廩, 駤びょ菣 鏥こ埣槎で But I wanted only to encoded special characters.
So what I need is to identify whether the string contains that kind of special character.Since I am dealing with multiple languages, is there any possible way to identify whether the string contain a HTML special characters?

Try this using the Regex.IsMatch Method:
string str = "*!#©™®";
var regx = new Regex("[^a-zA-Z0-9_.]");
if (regx.IsMatch(str))
{
Console.WriteLine("Special character(s) detected.");
}
See the Demo

Try the Regex.Replace method:
// Replace letters and numbers with nothing then check if there are any characters left.
// The only characters will be something like $, #, ^, or $.
//
// [\p{L}\p{Nd}]+ checks for words/numbers in any language.
if (!string.IsNullOrWhiteSpace(Regex.Replace(input, #"([\p{L}\p{Nd}]+)", "")))
{
// Do whatever with the string.
}
Detection demo.

I suppose that you could start by treating your string as a Char array
https://msdn.microsoft.com/en-us/library/system.char(v=vs.110).aspx
Then you can examine each character in turn. Indeed on a second read of that manual page why not use this:
string s = "Sometime these strings may contain some HTML special characters like trademark symbol(™), registered symbol(®), Copyright symbol(©) and etc.゜祌づ りゅ氧廩, 駤びょ菣 鏥こ埣槎で";
Char[] ca = s.ToCharArray();
foreach (Char c in ca){
if (Char.IsSymbol(c))
Console.WriteLine("found symbol:{0} ",c );
}

Related

Interacting with files that have unicode characters in filename / escape sequence issues

I am trying to grab a handle to a file that has unicode characters in the filename.
For example, I have a file called c:\testø.txt. If I try new FileInfo("c:\testø.txt") I get an Illegal characters exception.
Trying again with an escape sequence: new FileInfo("c:\test\u00f8.txt") and it works! Yay!
So I've got a method to escape non-ASCII characters:
static string EscapeNonAsciiCharacters(string value)
{
StringBuilder sb = new StringBuilder();
foreach (char c in value)
{
if (c > 127)
{
// This character is too big for ASCII
string encodedValue = "\\u" + ((int)c).ToString("x4");
sb.Append(encodedValue);
}
else
{
sb.Append(c);
}
}
return sb.ToString();
}
But when I take the output from this method the escape characters seem to be incorrect.
EscapeNonAsciiCharacters("c:\testø.txt") ## => "c:\test\\u00f8.txt"
When I pass that output to the FileInfo constructor, I get the illegal chars exception again. However, the \ in c:\ seems to be unaltered. When I look at how this character is represented within the StringBuilder in the static method, I see: {c: est\u00f8.txt} which leads me to believe that the first backslash is being escaped differently.
How can I properly append the characters escaped by the loop in EscapeNonAsciiCharacters so I don't get the double escape character in my output?
You have more escaped in those strings than you probably intend.
Note that \ needs to be escaped when in a string, because it is itself the escape character and \t means tab.
Windows, using NTFS, is fully unicode-capable, so the original error is most likely due to you not escaping the \ character.
I wrote a toy application to deal with the file named ʚ.txt, and the constructor has no problem with that or any other unicode characters.
So, instead of writing new FileInfo("c:\testø.txt"), You need to write new FileInfo("c:\\testø.txt") or new FileInfo(#"c:\testø.txt").
Your escape function is entirely unnecessary in the context of C# in general and NTFS (or, really, most modern file systems). External libraries may, themselves, have incompatibilities with unicode, but that will need to be dealt with separately.
You seem to be misunderstanding escaped characters.
In this C# code, it is the compiler that converts the \u00f8 to the correct unicode character:
new FileInfo("c:\test\u00f8.txt") // (the "\t" is actually causing an error here)
What you are doing here is just setting encodedValue to the string "\u00f8", and there is nothing ever converting the escaped string to the converted string:
string encodedValue = "\\u" + ((int)c).ToString("x4");
If you want to convert the escaped string, then you need to do something like this:
How to convert a string containing escape characters to a string

How would one trim all non-alphanumeric and numeric characters from the beginning and end of a string?

EDIT: I changed the title to reflect specifically what it is I'm trying to do.
Is there a way to retrieve all alphanumeric (or preferably, just the alphabet) characters for the current culture in .NET? My scenario is that I have several strings that I need to remove all numerals and non-alphabet characters from, and I'm not quite sure how I would implement this while honoring the alphabet of languages other than English (short of creating arrays of all alphabet characters for all supported languages of .NET, or at least the languages of our current clients lol)
UPDATE:
Specifically, what I'm trying to do is trim all non-alphabet chars from the start of the string up until the first alphabet character, and then from the last alphabet character to the end of the string. So for a random example in en-US, I want to turn:
()&*1#^#47*^#21%Littering aaaannnnd(*&^1#*32%#**)7(#9&^
into the following:
Littering aaaannnnd
This would be simple enough to do for English since it's my first language, but really in any culture I need to be able to remove numerals and other non-alphanumeric characters from the string.
string something = "()&*1#^#47*^#21%Littering aaaannnndóú(*&^1#*32%#**)7(#9&^";
string somethingNew = Regex.Replace(something, #"[^\p{L}-\s]+", "");
Is this what you're looking for?
Edit: Added to allow other languages characters. This will output Littering aaaannnndóú
Using regex method, this should work out:
string input = "()&*1#^#47*^#21%Littering aaaannnnd(*&^1#*32%#**)7(#9&^";
string result = Regex.Replace(input, "(?:^[^a-zA-Z]*|[^a-zA-Z]*$)", ""); //TRIM FROM START & END
Without using regex:
In Java, you could do:
while (true) {
if (word.length() == 0) {
return ""; // bad
}
if (!Character.isLetter(word.charAt(0))) {
word = word.substring(1);
continue; // so we are doing front first
}
if (!Character.isLetter(word.charAt(word.length()-1))) {
word = word.substring(0, word.length()-1);
continue; // then we are doing end
}
break; // if front is done, and end is done
}
If you are using something else, then java, substituting Character.isLetter is very straight forward, just search for character encoding and you will find the integer values for alphabetic characters, and you can use that to do it.

Removing hidden characters from within strings

My problem:
I have a .NET application that sends out newsletters via email. When the newsletters are viewed in outlook, outlook displays a question mark in place of a hidden character it can’t recognize. These hidden character(s) are coming from end users who copy and paste html that makes up the newsletters into a form and submits it. A c# trim() removes these hidden chars if they occur at the end or beginning of the string. When the newsletter is viewed in gmail, gmail does a good job ignoring them. When pasting these hidden characters in a word document and I turn on the “show paragraph marks and hidden symbols” option the symbols appear as one rectangle inside a bigger rectangle. Also the text that makes up the newsletters can be in any language, so accepting Unicode chars is a must. I've tried looping through the string to detect the character but the loop doesn't recognize it and passes over it. Also asking the end user to paste the html into notepad first before submitting it is out of the question.
My question:
How can I detect and eliminate these hidden characters using C#?
You can remove all control characters from your input string with something like this:
string input; // this is your input string
string output = new string(input.Where(c => !char.IsControl(c)).ToArray());
Here is the documentation for the IsControl() method.
Or if you want to keep letters and digits only, you can also use the IsLetter and IsDigit function:
string output = new string(input.Where(c => char.IsLetter(c) || char.IsDigit(c)).ToArray());
I usually use this regular expression to replace all non-printable characters.
By the way, most of the people think that tab, line feed and carriage return are non-printable characters, but for me they are not.
So here is the expression:
string output = Regex.Replace(input, #"[^\u0009\u000A\u000D\u0020-\u007E]", "*");
^ means if it's any of the following:
\u0009 is tab
\u000A is linefeed
\u000D is carriage return
\u0020-\u007E means everything from space to ~ -- that is, everything in ASCII.
See ASCII table if you want to make changes. Remember it would strip off every non-ASCII character.
To test above you can create a string by yourself like this:
string input = string.Empty;
for (int i = 0; i < 255; i++)
{
input += (char)(i);
}
What best worked for me is:
string result = new string(value.Where(c => char.IsLetterOrDigit(c) || (c >= ' ' && c <= byte.MaxValue)).ToArray());
Where I'm making sure the character is any letter or digit, so that I don't ignore any non English letters, or if it is not a letter I check whether it's an ascii character that is greater or equal than Space to make sure I ignore some control characters, this ensures I don't ignore punctuation.
Some suggest using IsControl to check whether the character is non printable or not, but that ignores Left-To-Right mark for example.
new string(input.Where(c => !char.IsControl(c)).ToArray());
IsControl misses some control characters like left-to-right mark (LRM) (the char which commonly hides in a string while doing copy paste). If you are sure that your string has only digits and numbers then you can use IsLetterOrDigit
new string(input.Where(c => char.IsLetterOrDigit(c)).ToArray())
If your string has special characters, then
new string(input.Where(c => c < 128).ToArray())
You can do this:
var hChars = new char[] {...};
var result = new string(yourString.Where(c => !hChars.Contains(c)).ToArray());
TLDR Answer
Use this Regex...
\P{Cc}\P{Cn}\P{Cs}
Like this...
var regex = new Regex(#"![\P{Cc}\P{Cn}\P{Cs}]");
TLDR Explanation
\P{Cc} : Do not match control characters.
\P{Cn} : Do not match unassigned characters.
\P{Cs} : Do not match UTF-8-invalid characters.
Working Demo
In this demo, I use this regex to search the string "Hello, World!". That weird character at the end is (char)4 — this is the character for END TRANSMISSION.
using System;
using System.Text.RegularExpressions;
public class Test {
public static void Main() {
var regex = new Regex(#"![\P{Cc}\P{Cn}\P{Cs}]");
var matches = regex.Matches("Hello, World!" + (char)4);
Console.WriteLine("Results: " + matches.Count);
foreach (Match match in matches) {
Console.WriteLine("Result: " + match);
}
}
}
Full Working Demo at IDEOne.com
The output from the above code:
Results: 1
Result: !
Alternatives
\P{C} : Match only visible characters. Do not match any invisible characters.
\P{Cc} : Match only non-control characters. Do not match any control characters.
\P{Cc}\P{Cn} : Match only non-control characters that have been assigned. Do not match any control or unassigned characters.
\P{Cc}\P{Cn}\P{Cs} : Match only non-control characters that have been assigned and are UTF-8 valid. Do not match any control, unassigned, or UTF-8-invalid characters.
\P{Cc}\P{Cn}\P{Cs}\P{Cf} : Match only non-control, non-formatting characters that have been assigned and are UTF-8 valid. Do not match any control, unassigned, formatting, or UTF-8-invalid characters.
Source and Explanation
Take a look at the Unicode Character Properties available that can be used to test within a regex. You should be able to use these regexes in Microsoft .NET, JavaScript, Python, Java, PHP, Ruby, Perl, Golang, and even Adobe. Knowing Unicode character classes is very transferable knowledge, so I recommend using it!
string output = new string(input.Where(c => !char.IsControl(c)).ToArray());
This will surely solve the problem. I had a non printable substitute characer(ASCII 26) in a string which was causing my app to break and this line of code removed the characters
I used this quick and dirty oneliner to clean some input from LTR/RTL marks left over by the broken Windows 10 calculator app. It's probably a far cry from perfect but good enough for a quick fix:
string cleaned = new string(input.Where(c => !char.IsControl(c) && (char.IsLetterOrDigit(c) || char.IsPunctuation(c) || char.IsSeparator(c) || char.IsSymbol(c) || char.IsWhiteSpace(c))).ToArray());
I experienced an error with the AWS S3 SDK
"Target resource path[name -‎3.‎30.‎2022 -‎15‎.‎27.‎00.pdf] has bidirectional characters, which are not supportedby System.Uri and thus cannot be handled by the .NET SDK"
The filename in my instance contained Unicode Character 'LEFT-TO-RIGHT MARK' (U+200E) between the dots. These were not visible in html or in Notepad++. When the text was pasted into Visual Studio 2019 Editor, the unicode text was visible and I was able to solve the issue.
The problem was solved by replacing all control and other non-printable characters from the filename using the following script.
var input = Regex.Replace(s, #"\p{C}+", string.Empty);
Credit Source: https://stackoverflow.com/a/40568888/1165173

Using a Regex to clean string versus Base64 Encoded string

I have a extension method that is using a Regex.Replace to clean up invalid characters in an user-entered string before it is added to a XML document.
The intent of the regex is to strip out some random hi-ASCII characters that are occasionally in the input when the user pastes text from Microsoft Word and replace them with a space:
public static string CleanInput(this string inputString) {
if (string.IsNullOrEmpty(inputString))
return string.Empty;
// Replace invalid characters with a space.
return Regex.Replace(inputString, #"[^\w\.#-]", " ");
}
Now as fate would have it, someone is now using this extension method on a string that contains base64-encoded data.
What I believe is that the regex will leave MOST of the base64 data unmodified, however I think it is might be changing some of it.
So - knowing that \w in the regex is matching [A-Za-z0-9_] and that Base64 effectively the same range, should this regex be changing the string or not?
If it is changing the string, why and how would you change it so that hi-ASCII garbage is still cleaned up in regular non-encoded text without mucking up the encoded string.
Base64 also uses +,/, and =.
You can add these to your character class:
[^\w\.#+/=-]
Note that - has to be last in order for it to be a literal hyphen-minus instead of specifying a range.
It may also be worth considering that \w isn't necessarily the same as [A-Za-z0-9_] according to Microsoft.

Problem with escape character

I have a string variable. And it contains the text:
\0#«Ия\0ьw7к\b\0E\0њI\0\0ЂЪ\n
When I try to add it to the TextBox control, nothing happens.Because \0 mean END.
How do I add text as it is?
UPDATE:
The text is placed in the variable dynamically.Thus, # is not suitable.
Is the idea that you want to display the backslashes? If so, the backslashes will need to be in the original string.
If you're getting that text from a string literal, it's just a case of making it a verbatim string literal:
string text = #"\0#«Ия\0ьw7к\b\0E\0њI\0\0ЂЪ\n";
If want to pass in a string which really contains the Unicode "nul" character (U+0000) then you won't be able to get Windows to display that. You should remove those characters first:
textBox.Text = value.Replace("\0", "");
"\\0#«Ия\\0ьw7к\\b\\0E\\0њI\\0\\0ЂЪ\\n"
or
#"\0#«Ия\0ьw7к\b\0E\0њI\0\0ЂЪ\n"
Well, I don't know where your text is coming from, but if you have to, you can use
using System.Text.RegularExpressions;
...
string escapedText = RegEx.Escape(originalText);
However, if it's not soon enough, the string will already contain null characters.
And it contains the text:
\0#«Ия\0ьw7к\b\0E\0њI\0\0ЂЪ\n
No it doesn't. That's what the debugger told you it contains. The debugger automatically formatted the content as though you had written it as a literal value in your source code. The string doesn't actually contain the backslashes, they were added by the debugger formatter.
The string actually contains binary zeros. You can see this for yourself by using string.ToCharArray(). You cannot display this string as-is, you have to get rid of the zeros. Displaying the content in hex could work for example, BitConverter.ToString(byte[]) helps with that.
You can't.
Standard Windows controls cannot display null characters.
If you're trying to display the literal text \0, change the string to start with an # sign, which tells the compiler not to parse escape sequences. (#\0#«Ия\0ьw7к\b\0E\0њI\0\0ЂЪ\n")
If you want to display as much of the string as you can, you can strip the nulls, like this:
textBox.Text = someString.Replace("\0", "");
You can also replace them with escape codes:
textBox.Text = someString.Replace("\0", #"\0");
You might try escaping the backslash in \0, i.e. \\0. See this MSDN reference for a full list of C# escape sequences.

Categories