Is it possible to detect text file encoding of two possible? - c#

I read How can I detect the encoding/codepage of a text file
It's not possible to detect encoding. However is it possible to detect whether encoding is one of two allowed?
For example I allow user to use Unicode UTF-8 and iso-8859-2 for their csv files. Is it possible to detect whether it is former or latter?

For example I allow user to use
Unicode UTF-8 and iso-8859-2 for their
csv files. Is it possible to detect
whether it is former or latter?
It's not possible with 100% accuracy because, for example, the bytes C3 B1 are an equally valid representation of "Ăą" in ISO-8859-2 as they are of "ñ" in UTF-8. In fact, because ISO-8859-2 assigns a character to all 256 possible bytes, every UTF-8 string is also a valid ISO-8859-2 string (representing different characters if non-ASCII).
However, the converse is not true. UTF-8 has strict rules about what sequences are valid. More than 99% of possible 8-octet sequences are not valid UTF-8. And your CSV files are probably much longer than that. Because of this, you can get good accuracy if you:
Perform a UTF-8 validity check. If it passes, assume the data is UTF-8.
Otherwise, assume it's ISO-8859-2.
However is it possible to detect
whether encoding is one of two
allowed?
UTF-32 (either byte order), UTF-8, and CESU-8 can be reliably detected by validation.
UTF-16 can be detected by presence of a BOM (but not by validation, since the only way for an even-length byte sequence to be invalid UTF-16 is to have unpaired surrogates).
If you have at least one "detectable" encoding, then you can check for the detectable encoding, and use the undetectable encoding as a fallback.
If both encodings are "undetectable", like ISO-8859-1 and ISO-8859-2, then it's more difficult. You could try a statistical approach like chardet uses.

Since it is impossible to detect the encoding, you still cannot detect it even when you limit it down to two possible encodings.
The only thing that I can think of is that you could try encoding it in one of the two possible encodings, but then you would have to check to see if it came out right. This would involve parsing of the text and even then you would not be 100% certain if it was right.

Both of those encodings share the same meaning for all octets <128.
So you would need to look at octets >= 128 to make the determination. Since in UTF-8 octets >= 128 always occur in groups (for 2 octet on longer sequences to encode a single code point) then a three octet sequence {<128, >=128, <128} would be an indication of ISO-8859-2.
If the file contains no, or very few octets outside ASCII (i.e. <128) then your ability to determine will be impossible or limited. Of course if the file starts with a UTF-8 encoded BOM (quite likely if from Windows) then you know it is UTF-8.
It is generally more reliable to use some meta-data (as XML does with its declaration) that relying on a heuristic because it is possible someone has sent you ISO-8859-3.

If you use a StreamReader there is an overload which will detect the encoding if possible (BOM) but defaults to UTF8 if detection fails.
I would suggest you use two options (UTF8 or Current) and if the user selects Current you use
var encoding = Encoding.GetEncoding(
CultureInfo.CurrentCulture.TextInfo.OEMCodePage);
var reader = new StreamReader(encoding);
which will most hopefully be the right encoding.

See my (recent) answer to the linked question: How can I detect the encoding/codepage of a text file
This class will check whether it is possible that the file is UTF-8, and then it will attempt to guess whether it is probable.

Related

How do I use C#'s IndexOf when strange characters are in the string

Below is what the text looks like when viewed in NotePad++.
I need to get the IndexOf for that peice of the string. for use the the below code. And I can't figure out how to use the odd characters in my code.
int start = text.IndexOf("AppxxxxxxDB INFO");
Where the "xxxxx"'s represent the strange characters.
All these characters have corresponding ASCII codes, you can insert them in a string by escaping it.
For instance:
"App\x0000\x0001\x0000\x0003\x0000\x0000\x0000DB INFO"
or shorter:
"App\x00\x01\x00\x03\x00\x00\x00"+"DB INFO"
\xXXXX means you specify one character with XXXX the hexadecimal number corresponding to the character.
Notepad++ simply wants to make it a bit more convenient by rendering these characters by printing the abbreviation in a "bubble". But that's just rendering.
The origin of these characters is printer (and other media) directives. For instance you needed to instruct a printer to move to the next line, stop the printing job, nowadays they are still used. Some terminals use them to communicate color changes, etc. The most well known is \n or \x000A which means you start a new line. For text they are thus characters that specify how to handle text. A bit equivalent to modern html, etc. (although it's only a limited equivalence). \n is thus only a new line because there is a consensus about that. If one defines his/her own encoding, he can invent a new system.
Echoing #JonSkeet's warning, when you read a file into a string, the file's bytes are decoded according to a character set encoding. The decoder has to do something with bytes values or sequences that are invalid per the encoding rules. Typical decoders substitute a replacement character and attempt to go on.
I call that data corruption. In most cases, I'd rather have the decoder throw an exception.
You can use a standard decoder, customize one or create a new one with the Encoding class to get the behavior you want. Or, you can preserve the original bytes by reading the file as bytes instead of as text.
If you insist on reading the file as text, I suggest using the 437 encoding because it has 256 characters, one for every byte value, no restrictions on byte sequences and each 437 character is also in Unicode. The bytes that represent text will possibly decode the same characters that you want to search for as strings, but you have to check, comparing 437 and Unicode in this table.
Really, you should have and follow the specification for the file type you are reading. After all, there is no text but encoded text, and you have to know which encoding it is.

Default C# String encoding

I am having some issues with the default string encoding in C#. I need to read strings from certain files/packets. However, these strings include characters from the 128-256 range (extended ascii), and all of these characters show up as question marks , instead of the proper character. For example, when reading a string ,it could come up as "S?meStr?n?" if the string contained the extended ascii characters.
Now, is there any way to change the default encoding for my application? I know in java you could define the default character set from command line.
There's no one single "extended ASCII" encoding. There are lots of different 8-bit encodings which are compatible with ASCII for the bottom 128 values.
You need to find out what encoding your files actually use, and specific that when reading the data with StreamReader (or whatever else you're using). For example, you may want encoding Windows-1252:
Encoding encoding = Encoding.GetEncoding(1252);
.NET strings are always sequences of UTF-16 code points. You can't change that, and you shouldn't try. (That's true in Java as well, and you really shouldn't use the platform default encoding when calling getBytes() etc unless that's what you really, really mean.)
An Encoding can be specified in at least one overload of functions for reading text - for example, ReadAllText(string, Encoding).
So if you no a file's encoded using Windows-1252, then you can specify it like so:
string contents = File.ReadAllText(someFilePath, Encoding.GetEncoding(1252));
Of course, doing this requires knowing ahead of time which code page is being used.

"Unable to translate Unicode character" error when saving to txt file

Additional information: Unable to
translate Unicode character \uDFFF at
index 195 to specified code page.
I made an algorithm, who's result are binary values (different lengths). I transformed it into uint, and then into chars and saved into stringbuilder, as you can see below:
uint n = Convert.ToUInt16(tmp_chars, 2);
_koded_text.Append(Convert.ToChar(n));
My problem is, that when i try to save those values into .txt i get the previously mentioned error.
StreamWriter file = new StreamWriter(filename);
file.WriteLine(_koded_text);
file.Close();
What i am saving is this: "忿췾᷿]볯褟ﶞ痢ﳻ��伞ﳴ㿯ﹽ翼蛿㐻ﰻ筹��﷿₩マ랿鳿⏟麞펿"... which are some weird signs.
What i need is to convert those binary values into some kind of string of chars and save it to txt. I saw somewhere that converting to UTF8 should help, but i don't know how to. Would changing files encoding help too?
You cannot transform binary data to a string directly. The Unicode characters in a string are encoded using utf16 in .NET. That encoding uses two bytes per character, providing 65536 distinct values. Unicode however has over one million codepoints. To make that work, the Unicode codepoints above \uffff (above the BMP, Basic Multilingual Plane) are encoded with a surrogate pair. The first one has a value between 0xd800 and 0xdbff, the second between 0xdc00 and 0xdfff. That provides 2 ^ (10 + 10) = 1 million additional codes.
You can perhaps see where this leads, in your case the code detects a high surrogate value (0xdfff) that isn't paired with a low surrogate. That's illegal. Lots more possible mishaps, several codepoints are unassigned, several are diacritics that get mangled when the string is normalized.
You just can't make this work. Base64 encoding is the standard way to carry binary data across a text stream. It uses 6 bits per character, 3 bytes require 4 characters. The character set is ASCII so the odds of the receiving program decoding the character back to binary incorrectly are minimal. Only a decades old IBM mainframe that uses EBCDIC could get you into trouble. Or just plain avoid encoding to text and keep it binary.
Since you're trying to encode binary data to a text stream this SO question already contains an answer to the question: "How do I encode something as base64?" From there plain ASCII/ANSI text is fine for the output encoding.

C# How can i remove newline characters from binary?

Basically i have binary data, i dont mind if it's unreadable but im writing it to a file which is parsed and so it's importance newline characters are taken out.
I thought i had done the right thing when i converted to string....
byte[] b = (byte[])SubKey.GetValue(v[i]);
s = System.Text.ASCIIEncoding.ASCII.GetString(b);
and then removed the newlines
String t = s.replace("\n","")
but its not working ?
Newline might be \r\n, and your binary data might not be ASCII encoded.
Firstly newline (Environment.Newline) is usually two characters on Windows, do you mean removing single carriage-return or line-feed characters?
Secondly, applying a text encoding to binary data is likely to lead to unexpected conversions. E.g. what will happen to buyes of the binary data that do not map to ASCII characters?
New line character may be \n or \r or \r\n depends on operating system type, in order this is markers for Linux, Macintosh and Windows.
But if you say you file is binary from what you know they have newlines in ASCII in her content?
If this is binary file this may be a some struct, if this they struct you after remove newline characters shift left all data after the this newline and corrupt data in her.
I would imagine removing the bytes in a binary chunk which correspond the line feeds would actually corrupt the binary data, thereby making it useless.
Perhaps you'd be better off using base64 encoding, which will produce ASCII-safe output.
If this is text data, then load it as text data (using the correct encoding), replace it as as a string, and re-encode it (using the correct encoding). For some encodings you might be able to do a swap at the file level (without decoding/encoding), but I wouldn't bet on it.
If this is any other binary representation, you will have to know the exact details. For example, it is common (but not for certain) for strings embedded in part of a binary file to have a length prefix. If you change the data without changing the length prefix, you've just corrupted the file. And to change the length prefix you need to know the format (it might be big-endian/little-endian, any fixed number of bytes, or the prefix itself could be variable length). Or it might be delimited. Or there might be relative offsets scattered through the file that all need fixing.
Just as likely; you could by chance have the same byte sequence in the binary that doesn't represent a newline; you could be completely trashing the data.

.NET: How do I tell if an encoding supports all the chars in my string?

I've got lots of text that I need to output, which includes all sorts of characters from many languages. Sometimes I need to output the text in character encodings other than Unicode (eg, Shift-JIS, or ISO-8859-2), in order to match the page it's going to.
If the text has characters that the encoding can't handle (eg, Japanese characters in ISO-8859-2 encoded output) I end up with odd characters in the output. I can escape them, but I'd rather do that only if it's really necessary.
So, my question is this: Is there a way I can tell ahead of time if an encoding can handle all the characters in my string?
EDIT:
I think the EncoderFallback is probably the right answer to the question I asked. Unfortunately it doesn't seem to work in my particular situation. My thought was to convert the characters to their HTML entity equivalents (eg, モ instead of モ). However, the encoder only converts the first such character it finds, and if I set the Response.ContentEncoding it never calls my EncoderFallback at all.
You can write your own EncoderFallback class assign that to the encoder before encoding.
Using this approach you need do nothing in advanced (which likely would be simply processing the output string looking for problems).
Instead your Fallback class need only handle replacements where the encoding does not have a value for a character.
Try to encode the string with an Encoding whose EncoderFallback is set to EncoderExceptionFallback. eg.:
Encoding e= Encoding.GetEncoding(932, new EncoderExceptionFallback(), new DecoderExceptionFallback());
Then catch EncoderFallbackException when you GetBytes().
I think the methods already should work. (The EncoderFallback solution seems quite nice.) Here's an alternative however, in case you prefer it.
Create an encoder for the encoding you want to test by calling encoding.GetEncoder().
You can then call the Convert method of the Encoder object, passing in your text, and looking at the value of the completed out parameter to determine whether it succeeded or not.
If speed is an issue, you may want to benchmark the various methods, but I suspect they would all have quite similar performance profiles.
Convert it to the target encoding, convert it back and compare it with the original?
Try Encoding.GetBytes() and Encoding.GetStrings() to convert hence and forth.
As an optimization you could search all used unicode characters from your original string and just use that to try out the encoding.

Categories