C#: Remove language specific symbols from string - c#

For the sake of the example, let's assume I am parsing some text written in German. This means that it contains symbols like ü or Ö. The problem is that when all German specific symbols get rendered as an empty square. Please take a look at this image:
Image http://img8.imageshack.us/img8/7502/93341046.png
Since I do not know whether this symbol is ü or Ö I want to replace it with "." (dot). So the string from the image above, should become "Osnabr.ck". How do I do that?
Any help would be greatly appreciated!
Best Regards,
Kiril

You can use a regular expression to replace any characters that you don't want. Just put the characters that you want in a negative set:
str = Regex.Replace(str, "[^0-9A-Za-z _]", ".");
You should look into what encoding you are using to decode the text. It looks like you are not using the same encoding as was used to encode the text as the characters doesn't show up correctly.

If you want to see the actual characters (and I notice you are displaying the value in the immediate window in visual studio), you need to use a font that can display the characters. The presence of the square means the font you are using does not contain glyphs that match those characters. You can change the font used in various parts of Visual Studio in the options dialog.
Some more detail in this question here.

There is a Replace method on the string class. It's easiest to replace a single character with something else:
InnerText.Replace("ü", ".");
You can change several characters at the same time by chaining Replace:
InnerText.Replace("ü", "[ue]").Replace("Ö", "[Oe]");

Related

Alignment issue when string contains non-alphanumeric characters like Chinese or Japanese

I'm working on small program written in C# where it will it will query the result from the Database and show it in text file format.
I have a problem when the result contains non-alphanumeric characters. Please take a look at the sample below.
Johnny $1000
Adam $1000
测测 $1000
You can see that Johnny and Adam line perfectly, but not the 测测 characters. I've seen this thread
C# string format for multiple languages
And it's actually possible to line it up using Tab instead of Space, but how do I calculate tabs needed because alphabet and chinese/japanese has different width. And I have to do it inside SQL, means that I cannot use something like MeasureString.
Any help will be appreciated.
#warheat1990: I am afraid that #Jeroen is right with his comment.
If you are supposed to use SQL you may try to choose a font which is predictable (like chinese character is twice the width of English) and calculate the width directly.So any chinese/japanese character would occupy twice the space as an english character would.

How to identify zero-width character?

Visual Studio 2015 found an unexpected character in my code (error CS1056)
How can I identify what the character is? It's a zero-width character so I can't see it. I'd like to know exactly what it is so I can work out where it comes from and how to fix it with a find-and-replace (I have many similar errors).
Here's an example. There's a zero-width character between x and y in the quote below:
x​y
It would be helpful just to tell me the name of the character in my example, but I'd also like to know generally how to identify characters myself.
I have a little bit of Javascript embedded within my explanation of Unicode which allows you to see the Unicode characters you copy/paste into a textbox. Your example looks like this:
Here you can see that the character is U+200B. Just searching for that will normally lead you to http://www.fileformat.info, in this case this page which can give you details of the character.
If you have the characters yourself within an application, Char.GetUnicodeCategory is your friend. (Oddly enough, there's no Char.GetUnicodeCategory(int) for non-BMP characters as far as I can see...)
According to similar question: Remove zero-width space characters from a JavaScript string
I'd hit ctrl+f (or ctrl+h) and turn on Regexp option, then search (or search-replace) for:
[\u200B-\u200D\uFEFF]
I've just tried your example and successfully replaced that zero-width space with "X" mark.
Just please note that this range covers only a few specific characters as explained in that post, not all invisible characters.
edit - thanks to this page I've found a better expression that seems nicely supported in the "find/replace" when Regexp option is turned on:
\p{Cf}
which seems to matches invisible characters, it successfully hit that one in your example, though I'm not exactly sure if it covers all you'd need. It may be worth playing with whole {C}-class or searching for whitespace|nonprintable plus negative match for {Z}-class (or {Zs}) negation.
Aha, use this website http://www.fileformat.info/info/unicode/char/search.htm?q=%E2%80%8B&preview=entity
Are you looking for Unicode character U+200B: ZERO WIDTH SPACE?
http://www.fileformat.info/info/unicode/char/200b/index.htm
You can ask the built-in Unicode table:
var category = char.GetUnicodeCategory(s[1]);
The specific character in your example is in the Format category and here is what MSDN has to say about it:
Format character that affects the layout of text or the operation of text processes, but is not normally rendered. Signified by the Unicode designation "Cf" (other, format). The value is 15.
To get the character code, simply extract it:
char c = s[1];
int codepoint = (int)c; // gives you 0x200B
The unicode codepoint 0x200b is known as "zero width space".

Converting the name of a character (i.e. comma) to an actual character (i.e. ,)

I am currently attempting to get kerning information out of a font file. I have got kerning pairs so far using Font Forge. It gives you a list of kerning pairs like so:
pos \Y \s -184;
pos \Y \A -154;
pos \o \period -49;
pos \r \period -150;
pos \r \comma -170;
The problem now lies in the fact that instead of giving me an ascii code or something, for punctuation an other non alphabetic characters it just gives the name of the character, e.g. comma, period etc.
Can anyone think of a way of converting the name 'comma' to the actual comma ',' character? I'm a bit stumped.
Going from the human readable character name to the character value would require a mapping between the two representations. You could do this yourself or search for a library that does it for you. I was not able to find a c library, but something like this java library may do the trick:
http://sisc-scheme.org/manual/javadoc/sisc/reader/CharUtil.html
The other option is to take a closer look at Font Forge to see if it has options to output the information you want directly.
http://fontforge.org/
There is no built in way to do so. The closest class that could have provided such information is CharUnicodeInfo, but it does not map names to characters.
You'd have to build custom map like following:
var nameToChar = new Dictionary<string,char>()
{ {"period", '.'} };
var ch = nameToChar["period"];

Escape string from file

I have to parse some files that contain some string that has characters in them that I need to escape. To make a short example you can imagine something like this:
var stringFromFile = "This is \\n a test \\u0085";
Console.WriteLine(stringFromFile);
The above results in the output:
This is \n a test \u0085
, but I want the text escaped. How do I do this in C#? The text contains unicode characters too.
To make clear; The above code is just an example. The text contains the \n and unicode \u00xx characters from the file.
Example of the file contents:
Fisika (vanaf Grieks, \u03C6\u03C5\u03C3\u03B9\u03BA\u03CC\u03C2,
\"Natuurlik\", en \u03C6\u03CD\u03C3\u03B9\u03C2, \"Natuur\") is die
wetenskap van die Natuur
Try it using: Regex.Unescape(string)
Should be the right way.
Att.
Don't use the # symbol -- this interprets the string as 100% literal. Just take it off and all shall be well.
EDIT
I may have been a bit hasty with my reply. I think what you're asking is: how can I have C# turn the literal string '\n' into a newline, when read from a file (similar question for other escaped literals).
The answer is: you write it yourself. You need to search for "\\n" and convert it to "\n". Keep in mind that in C#, it's the compiler not the language that changes your strings into actual literals, so there's not some library call to do this (actually there could be -- someone look this up, quick).
EDIT
Aha! Eureka! Behold:
http://msdn.microsoft.com/en-us/library/system.text.regularexpressions.regex.unescape.aspx
Since you are reading the string from a file, \n is not read as a unicode character but rather as two characters \ and n.
I would say you probably need a search an replace function to convert string "\n" to its unicode character '\n' and so on.
I don't think there's any easy way to do this. Because it's the job of lexical analyzer to parse literals.
I would try generating and compiling a class via CodeDOM with the string inserted there as constant. It's not very fast but it will do all escaping.

Problem with escape character

I have a string variable. And it contains the text:
\0#«Ия\0ьw7к\b\0E\0њI\0\0ЂЪ\n
When I try to add it to the TextBox control, nothing happens.Because \0 mean END.
How do I add text as it is?
UPDATE:
The text is placed in the variable dynamically.Thus, # is not suitable.
Is the idea that you want to display the backslashes? If so, the backslashes will need to be in the original string.
If you're getting that text from a string literal, it's just a case of making it a verbatim string literal:
string text = #"\0#«Ия\0ьw7к\b\0E\0њI\0\0ЂЪ\n";
If want to pass in a string which really contains the Unicode "nul" character (U+0000) then you won't be able to get Windows to display that. You should remove those characters first:
textBox.Text = value.Replace("\0", "");
"\\0#«Ия\\0ьw7к\\b\\0E\\0њI\\0\\0ЂЪ\\n"
or
#"\0#«Ия\0ьw7к\b\0E\0њI\0\0ЂЪ\n"
Well, I don't know where your text is coming from, but if you have to, you can use
using System.Text.RegularExpressions;
...
string escapedText = RegEx.Escape(originalText);
However, if it's not soon enough, the string will already contain null characters.
And it contains the text:
\0#«Ия\0ьw7к\b\0E\0њI\0\0ЂЪ\n
No it doesn't. That's what the debugger told you it contains. The debugger automatically formatted the content as though you had written it as a literal value in your source code. The string doesn't actually contain the backslashes, they were added by the debugger formatter.
The string actually contains binary zeros. You can see this for yourself by using string.ToCharArray(). You cannot display this string as-is, you have to get rid of the zeros. Displaying the content in hex could work for example, BitConverter.ToString(byte[]) helps with that.
You can't.
Standard Windows controls cannot display null characters.
If you're trying to display the literal text \0, change the string to start with an # sign, which tells the compiler not to parse escape sequences. (#\0#«Ия\0ьw7к\b\0E\0њI\0\0ЂЪ\n")
If you want to display as much of the string as you can, you can strip the nulls, like this:
textBox.Text = someString.Replace("\0", "");
You can also replace them with escape codes:
textBox.Text = someString.Replace("\0", #"\0");
You might try escaping the backslash in \0, i.e. \\0. See this MSDN reference for a full list of C# escape sequences.

Categories