Interacting with files that have unicode characters in filename / escape sequence issues - c#

I am trying to grab a handle to a file that has unicode characters in the filename.
For example, I have a file called c:\testø.txt. If I try new FileInfo("c:\testø.txt") I get an Illegal characters exception.
Trying again with an escape sequence: new FileInfo("c:\test\u00f8.txt") and it works! Yay!
So I've got a method to escape non-ASCII characters:
static string EscapeNonAsciiCharacters(string value)
{
StringBuilder sb = new StringBuilder();
foreach (char c in value)
{
if (c > 127)
{
// This character is too big for ASCII
string encodedValue = "\\u" + ((int)c).ToString("x4");
sb.Append(encodedValue);
}
else
{
sb.Append(c);
}
}
return sb.ToString();
}
But when I take the output from this method the escape characters seem to be incorrect.
EscapeNonAsciiCharacters("c:\testø.txt") ## => "c:\test\\u00f8.txt"
When I pass that output to the FileInfo constructor, I get the illegal chars exception again. However, the \ in c:\ seems to be unaltered. When I look at how this character is represented within the StringBuilder in the static method, I see: {c: est\u00f8.txt} which leads me to believe that the first backslash is being escaped differently.
How can I properly append the characters escaped by the loop in EscapeNonAsciiCharacters so I don't get the double escape character in my output?

You have more escaped in those strings than you probably intend.
Note that \ needs to be escaped when in a string, because it is itself the escape character and \t means tab.
Windows, using NTFS, is fully unicode-capable, so the original error is most likely due to you not escaping the \ character.
I wrote a toy application to deal with the file named ʚ.txt, and the constructor has no problem with that or any other unicode characters.
So, instead of writing new FileInfo("c:\testø.txt"), You need to write new FileInfo("c:\\testø.txt") or new FileInfo(#"c:\testø.txt").
Your escape function is entirely unnecessary in the context of C# in general and NTFS (or, really, most modern file systems). External libraries may, themselves, have incompatibilities with unicode, but that will need to be dealt with separately.

You seem to be misunderstanding escaped characters.
In this C# code, it is the compiler that converts the \u00f8 to the correct unicode character:
new FileInfo("c:\test\u00f8.txt") // (the "\t" is actually causing an error here)
What you are doing here is just setting encodedValue to the string "\u00f8", and there is nothing ever converting the escaped string to the converted string:
string encodedValue = "\\u" + ((int)c).ToString("x4");
If you want to convert the escaped string, then you need to do something like this:
How to convert a string containing escape characters to a string

Related

Detect Special Characters in a text in C#

In my program, I'm going to process some strings. These strings can be from any language.(eg. Japanese, Portuguese, Mandarin, English and etc.)
Sometime these strings may contain some HTML special characters like trademark symbol(™), registered symbol(®), Copyright symbol(©) and etc.
Then I am going to generate an Excel sheet with these details. But when these is a special character, even though the excel file is created it can not be open since it is appeared to be corrupted.
So what I did is encode string before writing into excel. But what happened next is, all the strings except from English were encoded. The picture shows that asset description which is a Japanese language text is also converted into encoded text. But I wanted to encoded special characters only.
゜祌づ りゅ氧廩, 駤びょ菣 鏥こ埣槎で is converted to ゜祌づ りゅ氧廩, 駤びょ菣 鏥こ埣槎で But I wanted only to encoded special characters.
So what I need is to identify whether the string contains that kind of special character.Since I am dealing with multiple languages, is there any possible way to identify whether the string contain a HTML special characters?
Try this using the Regex.IsMatch Method:
string str = "*!#©™®";
var regx = new Regex("[^a-zA-Z0-9_.]");
if (regx.IsMatch(str))
{
Console.WriteLine("Special character(s) detected.");
}
See the Demo
Try the Regex.Replace method:
// Replace letters and numbers with nothing then check if there are any characters left.
// The only characters will be something like $, #, ^, or $.
//
// [\p{L}\p{Nd}]+ checks for words/numbers in any language.
if (!string.IsNullOrWhiteSpace(Regex.Replace(input, #"([\p{L}\p{Nd}]+)", "")))
{
// Do whatever with the string.
}
Detection demo.
I suppose that you could start by treating your string as a Char array
https://msdn.microsoft.com/en-us/library/system.char(v=vs.110).aspx
Then you can examine each character in turn. Indeed on a second read of that manual page why not use this:
string s = "Sometime these strings may contain some HTML special characters like trademark symbol(™), registered symbol(®), Copyright symbol(©) and etc.゜祌づ りゅ氧廩, 駤びょ菣 鏥こ埣槎で";
Char[] ca = s.ToCharArray();
foreach (Char c in ca){
if (Char.IsSymbol(c))
Console.WriteLine("found symbol:{0} ",c );
}

Is there a way to differentiate a string argument between non-hexadecimal and a hexadecimal?

Let's say we have the following signature
void doSomething(string s)
When the user calls the function, they can call
doSomething("hello") or doSomething("\x15\x3C\xFF")
Is there a way to tell when the argument is the second form, a hexadecimal value?
I want to do something like
if(isHex(s))
// do this
else
// do that
No. This is not possible. To the runtime environment, a string is essentially just an array of characters (which is essentially just a collection of bytes). It has no idea how those characters were originally represented either in plain text or escaped sequences of hexadecimal.
You can use regex in order to check for valid hex strings. But in order to do this you must provide the string in hex notation as is, i.e. without C#'s interpretation and transformation into a normal string. Use a verbatim string (introduced by a "#") for this:
string s = #"\x15\x3C\xFF";
In verbatim strings, the backslashes are not interpreted as escape characters by c#. But the downside of this is that you are not getting the intended resulting string any more, of course.
public static bool IsHexString(string s)
{
return Regex.IsMatch(s, #"^(\\x[0-9A-F]{2})+$");
}
Explanation of the regular expression:
^ beginning of string.
\\ escaped backslash ("\"). Not a C# escape here, but a regex escape.
x the letter "x".
[0-9A-F]{2} two consecutive hex digits.
(...)+ at least one occurence of a hex number.
$ end of line.

Treat string input as litteral

When I receive input via C# it comes in escaping the \. When I'm trying to parse the string it causes an error because its using \\r instead of \r in the string. Is there some way to prevent it from escaping the \ or perhaps turning \\ into \ in the string. I've tried:
protected string UnEscape(string s)
{
if (s == "")
return " ";
return s.Replace(#"\\", #"\");
}
With no luck. So any other suggestions.
EDIT:
I was not specific enough as some of you seemed confused as to what I'm trying to achieve. In debug I was reading "\\t" in a string but I wanted "\t" not because I want to output \t but because I want to output a [tab]. With the code above I was sort of trying to recreate something that has already been done through Regex.Unescape(string).
The problem is that most .NET components do not process backslash escape sequences in strings: the compiler does it for them when the string is presented as a literal. However, there is another .NET component that processes escape sequences - the regex engine. You can use Regex.Unescape to do unescaping for you:
string escaped = #"Hello\thello\nWorld!";
string res = Regex.Unescape(escaped);
Console.WriteLine(res);
This prints
Hello hello
World!
Note that the example uses a verbatim string, so \t and \n are not replaced by the compiler. The string escaped is presented to regex engine with single slashes, (although you would see double slashes if you look at the string in the debugger).
The problem is not that it's escaping the backslash, it's that it's not parsing escape sequences into characters. Instead of getting the \r character when the characters \\ and r are entered, you get them as the two separate characters.
You can't turn #"\\" into #"\" in the string, because there isn't any double backslashes, that's only how the string is displayed when you look at it using debugging tools. It's actually a single backslash, and you can't turn that into the \ part of an escape sequence, because that's not a character by itself.
You need to replace any escape sequence in the input that you want to convert with the corresponding character:
s = s.Replace("\\r", "\r");
Edit:
To handle the special case that Servy is talking about, you replace all escape sequences at once. Example:
s = Regex.Replace(s, #"\\([\\rntb])", m => {
switch (m.Groups[1].Value) {
case "r": return "\r";
case "n": return "\n";
case "t": return "\t";
case "b": return "\b";
default: return "\\";
}
});
If you have the three characters \, \, r in the input and you want to change this to the \r character then try
input.replace(#"\\r", "\r");
If you have the two characters \, r in the input and you want to change this to the \r character then try
input.replace(#"\r", "\r");

Removing hidden characters from within strings

My problem:
I have a .NET application that sends out newsletters via email. When the newsletters are viewed in outlook, outlook displays a question mark in place of a hidden character it can’t recognize. These hidden character(s) are coming from end users who copy and paste html that makes up the newsletters into a form and submits it. A c# trim() removes these hidden chars if they occur at the end or beginning of the string. When the newsletter is viewed in gmail, gmail does a good job ignoring them. When pasting these hidden characters in a word document and I turn on the “show paragraph marks and hidden symbols” option the symbols appear as one rectangle inside a bigger rectangle. Also the text that makes up the newsletters can be in any language, so accepting Unicode chars is a must. I've tried looping through the string to detect the character but the loop doesn't recognize it and passes over it. Also asking the end user to paste the html into notepad first before submitting it is out of the question.
My question:
How can I detect and eliminate these hidden characters using C#?
You can remove all control characters from your input string with something like this:
string input; // this is your input string
string output = new string(input.Where(c => !char.IsControl(c)).ToArray());
Here is the documentation for the IsControl() method.
Or if you want to keep letters and digits only, you can also use the IsLetter and IsDigit function:
string output = new string(input.Where(c => char.IsLetter(c) || char.IsDigit(c)).ToArray());
I usually use this regular expression to replace all non-printable characters.
By the way, most of the people think that tab, line feed and carriage return are non-printable characters, but for me they are not.
So here is the expression:
string output = Regex.Replace(input, #"[^\u0009\u000A\u000D\u0020-\u007E]", "*");
^ means if it's any of the following:
\u0009 is tab
\u000A is linefeed
\u000D is carriage return
\u0020-\u007E means everything from space to ~ -- that is, everything in ASCII.
See ASCII table if you want to make changes. Remember it would strip off every non-ASCII character.
To test above you can create a string by yourself like this:
string input = string.Empty;
for (int i = 0; i < 255; i++)
{
input += (char)(i);
}
What best worked for me is:
string result = new string(value.Where(c => char.IsLetterOrDigit(c) || (c >= ' ' && c <= byte.MaxValue)).ToArray());
Where I'm making sure the character is any letter or digit, so that I don't ignore any non English letters, or if it is not a letter I check whether it's an ascii character that is greater or equal than Space to make sure I ignore some control characters, this ensures I don't ignore punctuation.
Some suggest using IsControl to check whether the character is non printable or not, but that ignores Left-To-Right mark for example.
new string(input.Where(c => !char.IsControl(c)).ToArray());
IsControl misses some control characters like left-to-right mark (LRM) (the char which commonly hides in a string while doing copy paste). If you are sure that your string has only digits and numbers then you can use IsLetterOrDigit
new string(input.Where(c => char.IsLetterOrDigit(c)).ToArray())
If your string has special characters, then
new string(input.Where(c => c < 128).ToArray())
You can do this:
var hChars = new char[] {...};
var result = new string(yourString.Where(c => !hChars.Contains(c)).ToArray());
TLDR Answer
Use this Regex...
\P{Cc}\P{Cn}\P{Cs}
Like this...
var regex = new Regex(#"![\P{Cc}\P{Cn}\P{Cs}]");
TLDR Explanation
\P{Cc} : Do not match control characters.
\P{Cn} : Do not match unassigned characters.
\P{Cs} : Do not match UTF-8-invalid characters.
Working Demo
In this demo, I use this regex to search the string "Hello, World!". That weird character at the end is (char)4 — this is the character for END TRANSMISSION.
using System;
using System.Text.RegularExpressions;
public class Test {
public static void Main() {
var regex = new Regex(#"![\P{Cc}\P{Cn}\P{Cs}]");
var matches = regex.Matches("Hello, World!" + (char)4);
Console.WriteLine("Results: " + matches.Count);
foreach (Match match in matches) {
Console.WriteLine("Result: " + match);
}
}
}
Full Working Demo at IDEOne.com
The output from the above code:
Results: 1
Result: !
Alternatives
\P{C} : Match only visible characters. Do not match any invisible characters.
\P{Cc} : Match only non-control characters. Do not match any control characters.
\P{Cc}\P{Cn} : Match only non-control characters that have been assigned. Do not match any control or unassigned characters.
\P{Cc}\P{Cn}\P{Cs} : Match only non-control characters that have been assigned and are UTF-8 valid. Do not match any control, unassigned, or UTF-8-invalid characters.
\P{Cc}\P{Cn}\P{Cs}\P{Cf} : Match only non-control, non-formatting characters that have been assigned and are UTF-8 valid. Do not match any control, unassigned, formatting, or UTF-8-invalid characters.
Source and Explanation
Take a look at the Unicode Character Properties available that can be used to test within a regex. You should be able to use these regexes in Microsoft .NET, JavaScript, Python, Java, PHP, Ruby, Perl, Golang, and even Adobe. Knowing Unicode character classes is very transferable knowledge, so I recommend using it!
string output = new string(input.Where(c => !char.IsControl(c)).ToArray());
This will surely solve the problem. I had a non printable substitute characer(ASCII 26) in a string which was causing my app to break and this line of code removed the characters
I used this quick and dirty oneliner to clean some input from LTR/RTL marks left over by the broken Windows 10 calculator app. It's probably a far cry from perfect but good enough for a quick fix:
string cleaned = new string(input.Where(c => !char.IsControl(c) && (char.IsLetterOrDigit(c) || char.IsPunctuation(c) || char.IsSeparator(c) || char.IsSymbol(c) || char.IsWhiteSpace(c))).ToArray());
I experienced an error with the AWS S3 SDK
"Target resource path[name -‎3.‎30.‎2022 -‎15‎.‎27.‎00.pdf] has bidirectional characters, which are not supportedby System.Uri and thus cannot be handled by the .NET SDK"
The filename in my instance contained Unicode Character 'LEFT-TO-RIGHT MARK' (U+200E) between the dots. These were not visible in html or in Notepad++. When the text was pasted into Visual Studio 2019 Editor, the unicode text was visible and I was able to solve the issue.
The problem was solved by replacing all control and other non-printable characters from the filename using the following script.
var input = Regex.Replace(s, #"\p{C}+", string.Empty);
Credit Source: https://stackoverflow.com/a/40568888/1165173

What does the # prefix do on string literals in C#

I read some C# article to combine a path using Path.Combine(part1,part2).
It uses the following:
string part1 = #"c:\temp";
string part2 = #"assembly.txt";
May I know what is the use of # in part1 and part2?
# is not related to any method.
It means that you don't need to escape special characters in the string following to the symbol:
#"c:\temp"
is equal to
"c:\\temp"
Such string is called 'verbatim' or #-quoted. See MSDN.
As other have said its one way so that you don't need to escape special characters and very useful in specifying file paths.
string s1 =#"C:\MyFolder\Blue.jpg";
One more usage is when you have large strings and want it to be displayed across multiple lines rather than a long one.
string s2 =#"This could be very large string something like a Select query
which you would want to be shown spanning across multiple lines
rather than scrolling to the right and see what it all reads up";
As stated in C# Language Specification 4.0:
2.4.4.5 String literals
C# supports two forms of string
literals: regular string literals and
verbatim string literals. A regular
string literal consists of zero or
more characters enclosed in double
quotes, as in "hello", and may include
both simple escape sequences (such as
\t for the tab character), and
hexadecimal and Unicode escape
sequences. A verbatim string literal
consists of an # character followed by
a double-quote character, zero or more
characters, and a closing double-quote
character. A simple example is
#"hello". In a verbatim string
literal, the characters between the
delimiters are interpreted verbatim,
the only exception being a
quote-escape-sequence. In particular,
simple escape sequences, and
hexadecimal and Unicode escape
sequences are not processed in
verbatim string literals.
It denotes a verbatim string literal, and allows you to use certain characters that normally have special meaning, for example \, which is normally an escape character, and new lines. For this reason it's very useful when dealing with Windows paths.
Without using #, the first line of your example would have to be:
string part1 = "c:\\temp";
More information here.
With # you dont have to escape special characters.
So you would have to write "c:\\temp" without #
If more presise it is called 'verbatim' strings. You could read here about it:
http://msdn.microsoft.com/en-us/library/aa691090(v=vs.71).aspx
The # just indicates a different way of specifying a string such that you do not have to escape characters with . the only caveat is that double quotes need to be "" to represent a single ".

Categories