I am having some issues with the default string encoding in C#. I need to read strings from certain files/packets. However, these strings include characters from the 128-256 range (extended ascii), and all of these characters show up as question marks , instead of the proper character. For example, when reading a string ,it could come up as "S?meStr?n?" if the string contained the extended ascii characters.
Now, is there any way to change the default encoding for my application? I know in java you could define the default character set from command line.
There's no one single "extended ASCII" encoding. There are lots of different 8-bit encodings which are compatible with ASCII for the bottom 128 values.
You need to find out what encoding your files actually use, and specific that when reading the data with StreamReader (or whatever else you're using). For example, you may want encoding Windows-1252:
Encoding encoding = Encoding.GetEncoding(1252);
.NET strings are always sequences of UTF-16 code points. You can't change that, and you shouldn't try. (That's true in Java as well, and you really shouldn't use the platform default encoding when calling getBytes() etc unless that's what you really, really mean.)
An Encoding can be specified in at least one overload of functions for reading text - for example, ReadAllText(string, Encoding).
So if you no a file's encoded using Windows-1252, then you can specify it like so:
string contents = File.ReadAllText(someFilePath, Encoding.GetEncoding(1252));
Of course, doing this requires knowing ahead of time which code page is being used.
Related
Below is what the text looks like when viewed in NotePad++.
I need to get the IndexOf for that peice of the string. for use the the below code. And I can't figure out how to use the odd characters in my code.
int start = text.IndexOf("AppxxxxxxDB INFO");
Where the "xxxxx"'s represent the strange characters.
All these characters have corresponding ASCII codes, you can insert them in a string by escaping it.
For instance:
"App\x0000\x0001\x0000\x0003\x0000\x0000\x0000DB INFO"
or shorter:
"App\x00\x01\x00\x03\x00\x00\x00"+"DB INFO"
\xXXXX means you specify one character with XXXX the hexadecimal number corresponding to the character.
Notepad++ simply wants to make it a bit more convenient by rendering these characters by printing the abbreviation in a "bubble". But that's just rendering.
The origin of these characters is printer (and other media) directives. For instance you needed to instruct a printer to move to the next line, stop the printing job, nowadays they are still used. Some terminals use them to communicate color changes, etc. The most well known is \n or \x000A which means you start a new line. For text they are thus characters that specify how to handle text. A bit equivalent to modern html, etc. (although it's only a limited equivalence). \n is thus only a new line because there is a consensus about that. If one defines his/her own encoding, he can invent a new system.
Echoing #JonSkeet's warning, when you read a file into a string, the file's bytes are decoded according to a character set encoding. The decoder has to do something with bytes values or sequences that are invalid per the encoding rules. Typical decoders substitute a replacement character and attempt to go on.
I call that data corruption. In most cases, I'd rather have the decoder throw an exception.
You can use a standard decoder, customize one or create a new one with the Encoding class to get the behavior you want. Or, you can preserve the original bytes by reading the file as bytes instead of as text.
If you insist on reading the file as text, I suggest using the 437 encoding because it has 256 characters, one for every byte value, no restrictions on byte sequences and each 437 character is also in Unicode. The bytes that represent text will possibly decode the same characters that you want to search for as strings, but you have to check, comparing 437 and Unicode in this table.
Really, you should have and follow the specification for the file type you are reading. After all, there is no text but encoded text, and you have to know which encoding it is.
I'm calling File.ReadAllText() in a program designed to format some files that I have.
Some of these files contain the ® (174) symbol. However, when the text is being read, the returned string contains � (65533) symbols where the ® (174) should be.
What would cause this and how can I fix it?
Most likely the file contains a different encoding than the default. If you know it, you can specify it using the File.ReadAllText Method (String, Encoding) override.
Code sample:
string readText = File.ReadAllText(path, Encoding.Default); // <-- change the encoding to whatever the encoding really is
If you DON'T know the encoding, see this previous SO question: How to use ReadAllText when file encoding unknown
This is likely due to a mismatch in the Encoding. Use the ReadAllText overload which allows you to specify the proper Encoding to use when reading the file.
The default overload will assume UTF-8 unless it can detect UTF-32. Any other encoding will come through incorrectly.
You need to specify the encoding when you call File.ReadAllText, unless the file is actually in UTF-8, which it sounds like it's not. (Basically the one-parameter overload is equivalent to passing in UTF-8 as the second argument. It will also detect UTF-32 with an appropriate byte-order mark, I believe.)
The first thing is to work out which encoding it is in (e.g. ISO-8859-1 - but you need to check this) and then pass that as a second argument.
For example:
Encoding isoLatin1 = Encoding.GetEncoding(28591);
string text = File.ReadAllText(path, isoLatin1);
It's always important that you know what encoding binary data is using before you try to read it as text. That's true for files, network streams, anything.
The character you are reading is the Replacement character
used to replace an incoming character whose value is unknown or unrepresentable in Unicode
compare the use of U+001A as a control character to indicate the substitute function
http://www.fileformat.info/info/unicode/char/fffd/index.htm
You are getting this because the actual encoding of the file does not match the encoding your program expects.
By default ReadAllText expects UTF-8. It is encountering a byte sequence that does not represent a valid UTF-8 character, so replacing it with the Replacement character.
I am trying to pass a block of text to a system I do not own, which will pass the data to a system I do own.
Unfortunately, when the first system talks to the second system, it uses a TSV format. Thus, I wonder if there's a convenient way to take my block of text and encode it in an ASCII format without any kind of whitespace (mostly newlines and tabs, of course), and then later decode it.
When I'm doing the encoding, I'm working in C#. When I'm doing the decoding, I'm working in Javascript.
I realize that I can write my own code to essentially "manually" perform the encoding and decoding by creating my own scheme, but I wonder if there already exists one for this purpose.
One option which would blow up the size of your data but be really simple to implement: UTF-8 encode all the text, base64-encode that:
byte[] utf8 = Encoding.UTF8.GetBytes(text);
string base64 = Convert.ToBase64(utf);
That won't contain any whitespace, and can be converted back. It'll be significantly larger than the original string, and unreadable... but it'll work.
You could try using HttpUtility.UrlEncode(string) or Uri.EscapeDataString(string), which would percent-encode any whitespace in the passed in text (as well as other special characters, which means the encoded text may be much larger than the original).
On the javascript side you could then use decodeURIComponent(string) to decode it back to the original text.
I read How can I detect the encoding/codepage of a text file
It's not possible to detect encoding. However is it possible to detect whether encoding is one of two allowed?
For example I allow user to use Unicode UTF-8 and iso-8859-2 for their csv files. Is it possible to detect whether it is former or latter?
For example I allow user to use
Unicode UTF-8 and iso-8859-2 for their
csv files. Is it possible to detect
whether it is former or latter?
It's not possible with 100% accuracy because, for example, the bytes C3 B1 are an equally valid representation of "Ăą" in ISO-8859-2 as they are of "ñ" in UTF-8. In fact, because ISO-8859-2 assigns a character to all 256 possible bytes, every UTF-8 string is also a valid ISO-8859-2 string (representing different characters if non-ASCII).
However, the converse is not true. UTF-8 has strict rules about what sequences are valid. More than 99% of possible 8-octet sequences are not valid UTF-8. And your CSV files are probably much longer than that. Because of this, you can get good accuracy if you:
Perform a UTF-8 validity check. If it passes, assume the data is UTF-8.
Otherwise, assume it's ISO-8859-2.
However is it possible to detect
whether encoding is one of two
allowed?
UTF-32 (either byte order), UTF-8, and CESU-8 can be reliably detected by validation.
UTF-16 can be detected by presence of a BOM (but not by validation, since the only way for an even-length byte sequence to be invalid UTF-16 is to have unpaired surrogates).
If you have at least one "detectable" encoding, then you can check for the detectable encoding, and use the undetectable encoding as a fallback.
If both encodings are "undetectable", like ISO-8859-1 and ISO-8859-2, then it's more difficult. You could try a statistical approach like chardet uses.
Since it is impossible to detect the encoding, you still cannot detect it even when you limit it down to two possible encodings.
The only thing that I can think of is that you could try encoding it in one of the two possible encodings, but then you would have to check to see if it came out right. This would involve parsing of the text and even then you would not be 100% certain if it was right.
Both of those encodings share the same meaning for all octets <128.
So you would need to look at octets >= 128 to make the determination. Since in UTF-8 octets >= 128 always occur in groups (for 2 octet on longer sequences to encode a single code point) then a three octet sequence {<128, >=128, <128} would be an indication of ISO-8859-2.
If the file contains no, or very few octets outside ASCII (i.e. <128) then your ability to determine will be impossible or limited. Of course if the file starts with a UTF-8 encoded BOM (quite likely if from Windows) then you know it is UTF-8.
It is generally more reliable to use some meta-data (as XML does with its declaration) that relying on a heuristic because it is possible someone has sent you ISO-8859-3.
If you use a StreamReader there is an overload which will detect the encoding if possible (BOM) but defaults to UTF8 if detection fails.
I would suggest you use two options (UTF8 or Current) and if the user selects Current you use
var encoding = Encoding.GetEncoding(
CultureInfo.CurrentCulture.TextInfo.OEMCodePage);
var reader = new StreamReader(encoding);
which will most hopefully be the right encoding.
See my (recent) answer to the linked question: How can I detect the encoding/codepage of a text file
This class will check whether it is possible that the file is UTF-8, and then it will attempt to guess whether it is probable.
I've got lots of text that I need to output, which includes all sorts of characters from many languages. Sometimes I need to output the text in character encodings other than Unicode (eg, Shift-JIS, or ISO-8859-2), in order to match the page it's going to.
If the text has characters that the encoding can't handle (eg, Japanese characters in ISO-8859-2 encoded output) I end up with odd characters in the output. I can escape them, but I'd rather do that only if it's really necessary.
So, my question is this: Is there a way I can tell ahead of time if an encoding can handle all the characters in my string?
EDIT:
I think the EncoderFallback is probably the right answer to the question I asked. Unfortunately it doesn't seem to work in my particular situation. My thought was to convert the characters to their HTML entity equivalents (eg, モ instead of モ). However, the encoder only converts the first such character it finds, and if I set the Response.ContentEncoding it never calls my EncoderFallback at all.
You can write your own EncoderFallback class assign that to the encoder before encoding.
Using this approach you need do nothing in advanced (which likely would be simply processing the output string looking for problems).
Instead your Fallback class need only handle replacements where the encoding does not have a value for a character.
Try to encode the string with an Encoding whose EncoderFallback is set to EncoderExceptionFallback. eg.:
Encoding e= Encoding.GetEncoding(932, new EncoderExceptionFallback(), new DecoderExceptionFallback());
Then catch EncoderFallbackException when you GetBytes().
I think the methods already should work. (The EncoderFallback solution seems quite nice.) Here's an alternative however, in case you prefer it.
Create an encoder for the encoding you want to test by calling encoding.GetEncoder().
You can then call the Convert method of the Encoder object, passing in your text, and looking at the value of the completed out parameter to determine whether it succeeded or not.
If speed is an issue, you may want to benchmark the various methods, but I suspect they would all have quite similar performance profiles.
Convert it to the target encoding, convert it back and compare it with the original?
Try Encoding.GetBytes() and Encoding.GetStrings() to convert hence and forth.
As an optimization you could search all used unicode characters from your original string and just use that to try out the encoding.