Using the uGUI Text component, I'm getting "replacement characters" aka � and I can't find a way to remove them.
I'm getting a string from the Instagram api which contains unicode characters for both non-alphabet language characters (for Japanese for example) which I need.
However, the unicode characters for the emojis come in as replacement characters aka �.
I don't require the emojis and they can be stripped out however, I can't find a method to do this.
I'm unable to use TextMeshPro as I'm unable generate a font asset with all the unicode characters need to display the various languages (this could be user error but when I try the process hangs).
I notice these � characters don't appear in the Inspector or console so there must be a way to ignore or remove them.
I'm setting the string like this
body.text = System.Uri.UnescapeDataString(postData.text);
I've tried a number of things that haven't worked including
body.text = body.text.Replace('\uFFFD','\'');//doesn't work
body.text = Regex.Replace(body.text, #"^[\ufffd]", string.Empty);//doesn't work
I've also tried breaking up the string as a char array. When I try to print to console I get this error when it hits a replacement character:
foreach (char item in postData.text.ToCharArray())
print(item); //Error: UTF-16 to UTF-8 conversion failed because the input string is invalid
Any help with this would be greatly appreciated!
Thank you.
Unity 2018.4.4, c#
Found the answer!
This post provided a solution: How do I remove emoji characters from a string?
body.text = Regex.Replace(body.text, #"\p{Cs}", "");
Related
I'm parsing a number of text files that contain 99.9% ascii characters. Numbers, basic punctuation and letters A-Z (upper and lower case).
The files also contain names, which occasionally contain characters which are part of the extended ascii character set, for example umlauts Ü and cedillas ç.
I want to only work with standard ascii, so I handle these extended characters by processing any names through a series of simple replace() commands...
myString = myString.Replace("ç", "c");
myString = myString.Replace("Ü", "U");
This works with all the strange characters I want to replace except for Ø (capital O with a forward slash through it). I think this has the decimal equivalent of 157.
If I process the string character-by-character using ToInt32() on each character it claims the decimal equivalent is 65533 - well outside the normal range of extended ascii codes.
Questions
why doesn't myString.Replace("Ø", "O"); work on this character?
How can I replace "Ø" with "O"?
Other information - may be pertinent. Opening the file with Notepad shows the character as a "Ø". Comparison with other sources indicate that the data is correct (i.e. the full string is "Jørgensen" - a valid Danish name). Viewing the character in visual studio shows it as "�". I'm getting exactly the same problem (with this one character) in hundreds of different files. I can happily replace all the other extended characters I encounter without problems. I'm using System.IO.File.ReadAllLines() to read all the lines into an array of strings for processing.
Replace works fine for the 'Ø' when it 'knows' about it:
Console.WriteLine("Jørgensen".Replace("ø", "o"));
In your case the problem is that you are trying to read the data with the wrong encoding, that's why the string does not contain the character which you are trying to replace.
Ø is part of the extended ASCII set - iso-8859-1, but File.ReadAllLines tries to detect encoding using BOM chars and, I suspect, falls back to UTF-8 in your case (see Remarks in the documentation).
The same behavior you see in the VS code - it tries to open the file with UTF-8 encoding and shows you �:
If you switch the encoding to the correct one - it shows the text correctly:
If you know what encoding is used for your files, just use it explicitly, here is an example to illustrate the difference:
// prints J?rgensen
File.ReadAllLines("data.txt")
.Select(l => l.Replace("Ø", "O"))
.ToList()
.ForEach(Console.WriteLine);
// prints Jorgensen
File.ReadAllLines("data.txt",Encoding.GetEncoding("iso-8859-1"))
.Select(l => l.Replace("Ø", "O"))
.ToList()
.ForEach(Console.WriteLine);
If you want to use chars from the default ASCII set, you may convert all special chars from the extended set to the base one (it will be ugly and non-trivial). Or you can search online how to deal with your concern, and you may find String.Normalize() or this thread with several other suggestions.
public static string RemoveDiacritics(string s)
{
var normalizedString = s.Normalize(NormalizationForm.FormD);
var stringBuilder = new StringBuilder();
for(var i = 0; i < normalizedString.Length; i++)
{
var c = normalizedString[i];
if(CharUnicodeInfo.GetUnicodeCategory(c) != UnicodeCategory.NonSpacingMark)
stringBuilder.Append(c);
}
return stringBuilder.ToString();
}
...
// prints Jorgensen
File.ReadAllLines("data.txt", Encoding.GetEncoding("iso-8859-1"))
.Select(RemoveDiacritics)
.ToList()
.ForEach(Console.WriteLine);
I'd strongly recommend reading C# in Depth: Unicode by Jon Skeet and Programming with Unicode by Victor Stinner books to have a much better understanding of what's going on :) Good luck.
PS. My code example is functional, compact but pretty inefficient, if you parse huge files consider using another solution.
I am writing an Android app in Unity using C#. The app will send SMS text messages that include a mixture of text and emojis.
My initial thought is to send the Unicode values of the respective emojis inline with any plain text. I have searched StackOverflow and I haven't found a concise example that solves this problem.
Here is code I have tried:
string mobile_num = "+18007671111; //Placeholder
string text = "Test: \\uFFFd\\uFFFd"; //(smile emoji Unicode value)
char[] chars = text.ToCharArray();
byte[] bytes = Encoding.UTF8.GetBytes(chars);
string message = HttpUtility.UrlEncode(bytes);
string sms = string.Format("sms:{0}?body={1}", mobile_num, message);
Application.OpenURL(sms);
I need to know:
1. Is this the correct approach?
a. if not, please help me correctly encode text + emoji data
b. What is the step required to covert so that the final message can be sent via SMS?
So after much searching, I found the simplest way in C# is to use:
\U########
Where:
\ is an escape character
U is a constant to define a Unicode sequence follows
## is the hex value of the emoticon encoded in exactly 8 characters left filled with zeros if necessary.
For example:
string u = "Smile: \U0001F601";
Will send:
Smile: 😁
Thank you Jeppe Stig Nielsen for your insight. For the full discussion follow this link:
How to convert numbers between hexadecimal and decimal
Issue identifying the form feed character in c# code when reading a file
string contents = File.ReadAllText(file);
I have attempted to encode in various formats and then run a replace using UTF-8 hex, UTF-32 hex values for the character.
In the watch window I see
'\f' character
but when i expand out the visualizer i see the actual female character
how do you identify which is the correct character to be searching for? Either the \f or some variation of the female sign?
I have looked at this site for the variations of encoding values with no luck at actually finding it in c#: www.fileformat.info/info/unicode/char/2640/index.htm
Your question is a little vague on whether you are trying to find the character \f or the ♀ character.
If you are trying to find the ♀ character, you can use the hexadecimal code 0x2640, or simply use the character as-is:
var ctn = File.ReadAllText("file.txt", Encoding.UTF8);
int pos = ctn.IndexOf((char)0x2640);
int pos1 = ctn.IndexOf('♀');
Clarification: I think the confusion might come from the fact that character ALT+12 and character ALT+2640 often produces the same 'Female Sign' character, but this is for historical reasons, as the ALT+12 is, in ASCII, a device control code. Only the ALT+2640 Unicode character is specifically designed to always produce the ♀ sign.
So, I re-ran everything this morning with the following combination of UTF8 encoding and searching on '\f'
string contents = File.ReadAllText(file, Encoding.UTF8);
int pos = contents.IndexOf("\f");
and finally got a hit.
I still don't know why the watch and visualizer display the character differently, but that combination of searching works.
Thanks everyone.
I'm new to programming and self taught. I'm trying to output the astrological symbol for Taurus, which is supposed to be U+2649 in Unicode. Here is the code I'm using...
string myString = "\u2649";
byte[] unicode = System.Text.Encoding.Unicode.GetBytes(myString);
Console.WriteLine(unicode.Length);
The result I'm getting is the number 2 instead of the symbol or font. I'm sure I'm doing something wrong.
Why are you converting it to unicode, this will not do anything.. lose the conversion and do the following:
string a ="\u2649" ;
Console.write(a) ;
You need to have a font which displays that glyph. If you do, then:
Console.WriteLine(myString);
is all you need.
EDIT: Note, the only font I could find which has this glyph is "MS Reference Sans Serif".
The length of the Unicode character, in bytes, is 2 and you are writing the Length to the Console.
Console.WriteLine(unicode.Length);
If you want to display the actual character, then you want:
Console.WriteLine(myString);
You must be using a font that has that Unicode range for it to display properly.
UPDATE:
Using default console font the above Console.WriteLine(myString) will output a ? character as there is no \u2649. As far I have so far googled, there is no easy way to make the console display Unicode characters that are not already part of the system code pages or the font you choose for the console.
It may be possible to change the font used by the console: Changing Console Fonts
You are outputting the length of the character, in bytes. The Console doesn't support unicode output, however, so it will come out as an '?' character.
We store a bunch of weird document names on our web server (people upload them) that have various characters like spaces, ampersands, etc. When we generate links to these documents, we need to escape them so the server can look up the file by its raw name in the database. However, none of the built in .NET escape functions will work correctly in all cases.
Take the document Hello#There.docx:
UrlEncode will handle this correctly:
HttpUtility.UrlEncode("Hello#There");
"Hello%23There"
However, UrlEncode will not handle Hello There.docx correctly:
HttpUtility.UrlEncode("Hello There.docx");
"Hello+There.docx"
The + symbol is only valid for URL parameters, not document names. Interestingly enough, this actually works on the Visual Studio test web server but not on IIS.
The UrlPathEncode function works fine for spaces:
HttpUtility.UrlPathEncode("Hello There.docx");
"Hello%20There.docx"
However, it will not escape other characters such as the # character:
HttpUtility.UrlPathEncode("Hello#There.docx");
"Hello#There.docx"
This link is invalid as the # is interpreted as a URL hash and never even gets to the server.
Is there a .NET utility method to escape all non-alphanumeric characters in a document name, or would I have to write my own?
Have a look at the Uri.EscapeDataString Method:
Uri.EscapeDataString("Hello There.docx") // "Hello%20There.docx"
Uri.EscapeDataString("Hello#There.docx") // "Hello%23There.docx"
I would approach it a different way: Do not use the document name as key in your look-up - use a Guid or some other id parameter that you can map to the document name on disk in your database. Not only would that guarantee uniqueness but you also would not have this problem of escaping in the first place.
You can use # character to escape strings. See the below pieces of code.
string str = #"\n\n\n\n";
Console.WriteLine(str);
Output: \n\n\n\n
string str1 = #"\df\%%^\^\)\t%%";
Console.WriteLine(str1);
Output: \df\%%^\^)\t%%
This kind of formatting is very useful for pathnames and for creating regexes.