I'm parsing a number of text files that contain 99.9% ascii characters. Numbers, basic punctuation and letters A-Z (upper and lower case).
The files also contain names, which occasionally contain characters which are part of the extended ascii character set, for example umlauts Ü and cedillas ç.
I want to only work with standard ascii, so I handle these extended characters by processing any names through a series of simple replace() commands...
myString = myString.Replace("ç", "c");
myString = myString.Replace("Ü", "U");
This works with all the strange characters I want to replace except for Ø (capital O with a forward slash through it). I think this has the decimal equivalent of 157.
If I process the string character-by-character using ToInt32() on each character it claims the decimal equivalent is 65533 - well outside the normal range of extended ascii codes.
Questions
why doesn't myString.Replace("Ø", "O"); work on this character?
How can I replace "Ø" with "O"?
Other information - may be pertinent. Opening the file with Notepad shows the character as a "Ø". Comparison with other sources indicate that the data is correct (i.e. the full string is "Jørgensen" - a valid Danish name). Viewing the character in visual studio shows it as "�". I'm getting exactly the same problem (with this one character) in hundreds of different files. I can happily replace all the other extended characters I encounter without problems. I'm using System.IO.File.ReadAllLines() to read all the lines into an array of strings for processing.
Replace works fine for the 'Ø' when it 'knows' about it:
Console.WriteLine("Jørgensen".Replace("ø", "o"));
In your case the problem is that you are trying to read the data with the wrong encoding, that's why the string does not contain the character which you are trying to replace.
Ø is part of the extended ASCII set - iso-8859-1, but File.ReadAllLines tries to detect encoding using BOM chars and, I suspect, falls back to UTF-8 in your case (see Remarks in the documentation).
The same behavior you see in the VS code - it tries to open the file with UTF-8 encoding and shows you �:
If you switch the encoding to the correct one - it shows the text correctly:
If you know what encoding is used for your files, just use it explicitly, here is an example to illustrate the difference:
// prints J?rgensen
File.ReadAllLines("data.txt")
.Select(l => l.Replace("Ø", "O"))
.ToList()
.ForEach(Console.WriteLine);
// prints Jorgensen
File.ReadAllLines("data.txt",Encoding.GetEncoding("iso-8859-1"))
.Select(l => l.Replace("Ø", "O"))
.ToList()
.ForEach(Console.WriteLine);
If you want to use chars from the default ASCII set, you may convert all special chars from the extended set to the base one (it will be ugly and non-trivial). Or you can search online how to deal with your concern, and you may find String.Normalize() or this thread with several other suggestions.
public static string RemoveDiacritics(string s)
{
var normalizedString = s.Normalize(NormalizationForm.FormD);
var stringBuilder = new StringBuilder();
for(var i = 0; i < normalizedString.Length; i++)
{
var c = normalizedString[i];
if(CharUnicodeInfo.GetUnicodeCategory(c) != UnicodeCategory.NonSpacingMark)
stringBuilder.Append(c);
}
return stringBuilder.ToString();
}
...
// prints Jorgensen
File.ReadAllLines("data.txt", Encoding.GetEncoding("iso-8859-1"))
.Select(RemoveDiacritics)
.ToList()
.ForEach(Console.WriteLine);
I'd strongly recommend reading C# in Depth: Unicode by Jon Skeet and Programming with Unicode by Victor Stinner books to have a much better understanding of what's going on :) Good luck.
PS. My code example is functional, compact but pretty inefficient, if you parse huge files consider using another solution.
Related
Using the uGUI Text component, I'm getting "replacement characters" aka � and I can't find a way to remove them.
I'm getting a string from the Instagram api which contains unicode characters for both non-alphabet language characters (for Japanese for example) which I need.
However, the unicode characters for the emojis come in as replacement characters aka �.
I don't require the emojis and they can be stripped out however, I can't find a method to do this.
I'm unable to use TextMeshPro as I'm unable generate a font asset with all the unicode characters need to display the various languages (this could be user error but when I try the process hangs).
I notice these � characters don't appear in the Inspector or console so there must be a way to ignore or remove them.
I'm setting the string like this
body.text = System.Uri.UnescapeDataString(postData.text);
I've tried a number of things that haven't worked including
body.text = body.text.Replace('\uFFFD','\'');//doesn't work
body.text = Regex.Replace(body.text, #"^[\ufffd]", string.Empty);//doesn't work
I've also tried breaking up the string as a char array. When I try to print to console I get this error when it hits a replacement character:
foreach (char item in postData.text.ToCharArray())
print(item); //Error: UTF-16 to UTF-8 conversion failed because the input string is invalid
Any help with this would be greatly appreciated!
Thank you.
Unity 2018.4.4, c#
Found the answer!
This post provided a solution: How do I remove emoji characters from a string?
body.text = Regex.Replace(body.text, #"\p{Cs}", "");
Issue identifying the form feed character in c# code when reading a file
string contents = File.ReadAllText(file);
I have attempted to encode in various formats and then run a replace using UTF-8 hex, UTF-32 hex values for the character.
In the watch window I see
'\f' character
but when i expand out the visualizer i see the actual female character
how do you identify which is the correct character to be searching for? Either the \f or some variation of the female sign?
I have looked at this site for the variations of encoding values with no luck at actually finding it in c#: www.fileformat.info/info/unicode/char/2640/index.htm
Your question is a little vague on whether you are trying to find the character \f or the ♀ character.
If you are trying to find the ♀ character, you can use the hexadecimal code 0x2640, or simply use the character as-is:
var ctn = File.ReadAllText("file.txt", Encoding.UTF8);
int pos = ctn.IndexOf((char)0x2640);
int pos1 = ctn.IndexOf('♀');
Clarification: I think the confusion might come from the fact that character ALT+12 and character ALT+2640 often produces the same 'Female Sign' character, but this is for historical reasons, as the ALT+12 is, in ASCII, a device control code. Only the ALT+2640 Unicode character is specifically designed to always produce the ♀ sign.
So, I re-ran everything this morning with the following combination of UTF8 encoding and searching on '\f'
string contents = File.ReadAllText(file, Encoding.UTF8);
int pos = contents.IndexOf("\f");
and finally got a hit.
I still don't know why the watch and visualizer display the character differently, but that combination of searching works.
Thanks everyone.
I'm writing a program that reads all the text in a file into a string, loops over that string looking at the characters, and then appends the characters back to another string using a Stringbuilder. The issue I'm having is when it's written back out, the special characters such as “ and ” , come out looking like � characters instead. I don't need to do a conversion, I just want it written back out the way I read it in:
StringBuilder sb = new StringBuilder();
string text = File.ReadAllText(filePath);
for (int i = 0; i < text.Length; ++i) {
if (text[i] != '{') { // looking for opening curly brace
sb.Append(text[i]);
continue;
}
// Do stuff
}
File.WriteAllText(destinationFile, sb.ToString());
I tried using different Encodings (UTF-8, UTF-16, ASCII), but then it just came out even worse; I started getting question mark symbols and Chinese characters (yes, a bit of a shotgun approach, but I was just experimenting).
I did read this article: http://www.joelonsoftware.com/articles/Unicode.html
...but it didn't really explain why I was seeing what I saw, unless in C#, the reader starts cutting off bits when it hits weird characters like that. Thanks in advance for any help!
TL;DR that is definitely not UTF-8 and you are not even using UTF-8 to read the resulting file. Read as Windows1252, write as Windows1252 (If you are going to use the same viewing method to view the resulting file)
Well let's first just say that there is no way a file made by a regular user will be in UTF-8. Not all programs in windows even support it (excel, notepad..), let alone have it as default encoding (even most developer tools don't default to utf-8, which drives me insane). Since a lot of developers don't understand that such a thing as encoding even exists, then what chances do regular users have of saving their files in an utf-8 hostile environment?
This is where your problems first start. According to documentation, the overload you are using File.ReadAllText(filePath); can only detect UTF-8 or UTF-32.
Indeed, simply reading a file encoded normally in Windows-1252 that contains "a”a" results in a string "a�a", where � is the unicode replacement character (Read the wikipedia section, it describes exactly the situation you are in!) used to replace invalid bytes. When the replacement character is again encoded as UTF-8, and interpreted as Windows-1252, you will see � because the bytes for � in UTF-8 are 0xEF, 0xBF, 0xBD which are the bytes for � in Windows-1252.
So read it as Windows-1252 and you're half-way there:
Encoding windows1252 = Encoding.GetEncoding("Windows-1252");
String result = File.ReadAllText(#"C:\myfile.txt", windows1252);
Console.WriteLine(result); //Correctly prints "a”a" now
Because you saw �, the tool you are viewing the newly made file with is also using Windows-1252. So if the goal is to have the file show correct characters in that tool, you must encode the output as Windows-1252:
Encoding windows1252 = Encoding.GetEncoding("Windows-1252");
File.WriteAllText(#"C:\myFile", sb.toString(), windows1252);
Chances are the text will be UTF8.
File.ReadAllText(filePath, Encoding.UTF8)
coupled with
File.WriteAllText(destinationFile, sb.ToString(), Encoding.UTF8)
should cover off dealing with the Unicode characters. If you do one or the other you're going to get garbage output, both or nothing.
I have a mixed Hebrew/english string to parse.
The string is built like this:
[3 hebrew] [2 english 2] [1 hebrew],
So, it can be read as: 1 2 3, and it is stored as 3 2 1 (exact byte sequence in file, double-checked in hex editor, and anyway RTL is only the display attribute). .NET regex parser has RTL option, which (when given for plain LTR text) starts processing from right side of the string.
I am wondering, when this option should be applied to extract [3 hebrew] and [2 english] parts from the string,or to check if [1 hebrew] matches the end of the string? Are there any hidden specifics or there's nothing to worry about (like when processing any LTR string with special unicode characters)?
Also, can anyone recommend me a good RTL+LTR text editor? (afraid that VS Express displays the text wrong sometimes, and if it can even start messing the saved strings - I would like to re-check the files without using hex editors anymore)
The RightToLeft option refers to the order through the character sequence that the regular expression takes, and should really be called LastToFirst since in the case of Hebrew and Arabic it is actually left-to-right, and with mixed RLT and LTR text such as you describe the expression "right to left" is even less appropriate.
This has a minor effect on speed (will only matter if the searched text is massive) and on regular expressions that are done with a startAt index (searching those earlier in the string than startAt rather than later in the string).
Examples; let's hope the browers don't mess this up too much:
string saying = "למכות is in כתר"; //Just because it amuses me that this is a saying whatever way round the browser puts malkuth and kether.
string kether = "כתר";
Console.WriteLine(new Regex(kether, RegexOptions.RightToLeft).IsMatch(saying));//True
Console.WriteLine(new Regex(kether, RegexOptions.None).IsMatch(saying));//True, perhaps minutely faster but so little that noise would hide it.
Console.WriteLine(new Regex(kether, RegexOptions.RightToLeft).IsMatch(saying, 2));//False
Console.WriteLine(new Regex(kether, RegexOptions.None).IsMatch(saying, 2));//True
//And to show that the ordering is codepoint rather than physical display ordering:
Console.WriteLine(new Regex("" + kether[0] + ".*" + kether[2]).IsMatch(saying));//True
Console.WriteLine(new Regex("" + kether[2] + ".*" + kether[0]).IsMatch(saying));//False
Simple yes or no question, and I'm 90% sure that it is no... but I'm not sure.
Can a Base64 string contain tabs?
It depends on what you're asking. If you are asking whether or not tabs can be base-64 encoded, then the answer is "yes" since they can be treated the same as any other ASCII character.
However, if you are asking whether or not base-64 output can contain tabs, then the answer is no. The following link is for an article detailing base-64, including which characters are considered valid:
http://en.wikipedia.org/wiki/Base64
The short answer is no - but Base64 cannot contain carriage returns either.
That is why, if you have multiple lines of Base64, you strip out any carriage returns, line feeds, and anything else that is not in the Base64 alphabet
That includes tabs.
From wikipedia.com:
The current version of PEM (specified
in RFC 1421) uses a 64-character
alphabet consisting of upper- and
lower-case Roman alphabet characters
(A–Z, a–z), the numerals (0–9), and
the "+" and "/" symbols. The "="
symbol is also used as a special
suffix code. The original
specification, RFC 989, additionally
used the "*" symbol to delimit encoded
but unencrypted data within the output
stream.
As you can see, tab characters are not included. However, you can of course encode a tab character into a base64 string.
Sure. Tab is just ASCII character 9, and that has a base64 representation just like any other integer.
Base64 specification (RFC 4648) states in Section 3.3 that any encountered non-alphabet characters should be rejected unless explicitly allowed by another specification:
Implementations MUST reject the
encoded data if it contains
characters outside the base alphabet
when interpreting base-encoded
data, unless the specification
referring to this document explicitly
states otherwise. Such specifications
may instead state, as MIME does,
that characters outside the base
encoding alphabet should simply be
ignored when interpreting data ("be
liberal in what you accept").
Note that this means that any
adjacent carriage return/ line feed
(CRLF) characters constitute
"non-alphabet characters" and are
ignored.
Specs such as PEM (RFC 1421) and MIME (RFC 2045) specify that Base64 strings can be broken up by whitespaces. Per referenced RFC 822, a tab (HTAB) is considered a whitespace character.
So, when Base64 is used in context of either MIME or PEM (and probably other similar specifications), it can contain whitespace, including tabs, which should be handled (stripped out) while decoding the encoded content.
Haha, as you see from the responses, this is actually not such a simple yes no answer.
A resulting Base64 string after conversion cannot contain a tab character, but It seems to me that you are not asking that, seems to me that you are asking can you represent a string (before conversion) containing a tab in Base64, and the answer to that is yes.
I would add though that really what you should do is make sure that you take care to preserve the encoding of your string, i.e. convert it to an array of bytes with your correct encoding (Unicode, UTF-8 whatever) then convert that array of bytes to base64.
EDIT: A simple test.
private void button2_Click(object sender, EventArgs e)
{
StringBuilder sb = new StringBuilder();
string test = "The rain in spain falls \t mainly on the plain";
sb.AppendLine(test);
UTF8Encoding enc = new UTF8Encoding();
byte[] b = enc.GetBytes(test);
string cvtd = Convert.ToBase64String(b);
sb.AppendLine(cvtd);
byte[] c = Convert.FromBase64String(cvtd);
string backAgain = enc.GetString(c);
sb.AppendLine(backAgain);
MessageBox.Show(sb.ToString());
}
It seems that there is lots of confusion here; and surprisingly most answers are of "No" variety. I don't think that is a good canonical answer.
The reason for confusion is probably the fact that Base64 is not strictly specified; multiple practical implementations and interpretations exist.
You can check out link text for more discussion on this.
In general, however, conforming base64 codecs SHOULD understand linefeeds, as they are mandated by some base64 definitions (76 character segments, then linefeed etc).
Because of this, most decoders also allow for indentation whitespace, and quite commonly any whitespace between 4-character "triplets" (so named since they encode 3 bytes).
So there's a good chance that in practice you can use tabs and other white space.
But I would not add tabs myself if generating base64 content sent to a service -- be conservative at what you send, (more) liberal at what you receive.
Convert.FromBase64String() in the .NET framework does not seem to mind them. I believe all whitespace in the string is ignored.
string xxx = "ABCD\tDEFG"; //simulated Base64 encoded string w/added tab
Console.WriteLine(xxx);
byte[] xx = Convert.FromBase64String(xxx); // convert string back to binary
Console.WriteLine(BitConverter.ToString(xx));
output:
ABCD DEFG
00-10-83-0C-41-46
The relevant clause of RFC-2045 (6:8)
The encoded output stream must be
represented in lines of no more
than 76 characters each. All line
breaks or other characters not
found in Table 1 must be ignored by
decoding software. In base64 data,
characters other than those in Table
1, line breaks, and other white
space probably indicate a transmission
error, about which a warning
message or even a message rejection
might be appropriate under some
circumstances.
YES!
Base64 is used to encode ANY 8bit value (Decimal 0 to 255) into a string using a set of safe characters. TAB is decimal 9.
Base 64 uses one of the following character sets:
Data: ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/
URLs: ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789-_
Binary Attachments (eg: email) in text are also encoded using this system.