C# Unrecognized characters while reading from binary file

C# Unrecognized characters while reading from binary file - c#

I have some items who's information is split into two parts, one is contents of a binary file, and other is textual entry inside .txt file. I am trying to make an app that will pack this info into one textual file (textual file because I have reasons to want this file to also be humanly readable as well), with ability to later unpack that file back by creating new binary file and text entry.
The first problem I ran into so far: some info is lost when converting binary into string (or perhaps sooner, during reading of bytes), and I'm not sure if the file is in weird format or I'm doing something wrong. Some characters get shown as question marks.
Example of characters which are replaced with question marks:
ýÿÿ
This is the part where info is read from the binary file and gets encoded into a string (which is how I inteded to store it inside a text file).
byte[] binaryFile = File.ReadAllBytes(pathBinary);
// I also tried this for some reason: byte[] binaryFile = Encoding.ASCII.GetBytes(File.ReadAllText(pathBinary));
string binaryFileText = Convert.ToBase64String(binaryFile); //this is the coded string that goes into joined file to hold binary file information, when decoded the result shows question marks instead of some characters
MessageBox.Show("binary file text: " + Encoding.ASCII.GetString(binaryFile), "debug", MessageBoxButtons.OK, MessageBoxIcon.Information); //this also shows question marks
I expect a few more caveats along the way with second functionality of the app (unpacking back into text and binary), but so far my main problem is unrecognized characters during reading of the binary file or converting it into string, which makes this data unusable in storing as text for purpose of reproducing the file. Any help would be appreciated.

There is no universal conversion of binary string data to a string. A string is a series of unicode characters and as such can hold any character of the unicode range.
Binary data is a series of bytes and as such can be anything from video to a string in various formats.
Since there are multiple binary string representations, you need an Encoding to convert one into the other. The encoding you choose has to match the binary string format. If it doesn't you will get the wrong result.
You are using ASCII encoding for the conversion, which is obviously incorrect. ASCII can not encode the full unicode range. That means even if you use it for encoding, the result of the decoding will not always match the original text.
If you have both, encoding and decoding under control, use an Encoding that can do the full round trip, such as UTF8 or Unicode. If you don't encode the string yourself, use the correct Encoding.

Related

How do I extract UTF-8 strings out of a JSON file using LitJSON, as JsonData does not seem to convert?

I've tried many methods to extract some strings out of a JSON file using LitJson in Unity.
I've encoding converts all over, tried getting byte arrays and sending them around and nothing seems to work.
I went to the very start of where I create the JsonData object and tried to run the following test:
public JsonData CreateJSONDataObject()
{
Debug.Assert(pathName != null, "No JSON Data path name set. Please set before commencing read.");
string jsonString = File.ReadAllText(Application.dataPath + pathName, System.Text.Encoding.UTF8);
JsonData jsonDataObject = JsonMapper.ToObject(jsonString);
Debug.Log("Test compatibility: ë | " + jsonDataObject["Roots"][2]["name"]);
return jsonDataObject;
}
I made sure my jsonString is using UTF-8, however the output shows this:
Test compatibility: ë | W�den
I've tried many other methods, but as this is making sure to encode right when creating a JsonData object I can't think of what I am doing wrong as I just don't know enough about JSON.
Thank you in advance.

This type of problem occurs when a text file is written with one encoding and read using a different one. I was able to reproduce your problem with the following program, which removes the JSON serialization from the equation entirely:
string file = #"c:\temp\test.txt";
string text = "Wöden";
File.WriteAllText(file, text, Encoding.Default));
string text2 = File.ReadAllText(file, Encoding.UTF8);
Debug.WriteLine(text2);
Since you are reading with UTF-8 and it is not working, the real question is, what encoding was used to write the file originally? You should be using the same encoding to read it back. I suspect that the file was originally created using either Windows-1252 or iso-8859-1 instead of UTF-8. Try using one of those when you read the file, e.g.:
string jsonString = File.ReadAllText(Application.dataPath + pathName,
Encoding.GetEncoding("Windows-1252"));
You said in the comments that your JSON file was not created programmatically, but was "written by hand", meaning you used Notepad or some other text editor to make the file. If that is so, then that explains how you got into this situation. When you save the file, you should have the option to choose an encoding. For Notepad at least, the default encoding is "ANSI", which most likely maps to Windows-1252 (Western European), but depends on your locale. If you are in the Baltic region, for example, it would be Windows-1257 (Baltic). In any case, "ANSI" is not UTF-8. If you want to save the file in UTF-8 encoding, you have to specifically choose that option. Whatever option you use to save the file, that is the encoding you need to use to read it the next time, whether it is with a text editor or with code. Using the wrong encoding to read the file is what causes the corruption.
To change the encoding of a file, you first have to read it in using the same encoding that it was saved in originally, and then you can write it back out using a different encoding. You can do that with your text editor, simply by re-saving the file with a different encoding, or you can do that programmatically:
string text = File.ReadAllText(file, originalEncoding);
File.WriteAllText(file, text, newEncoding);
The key is knowing which encoding was used originally, and therein lies the rub. For legacy encodings (such as Windows-12xx) there is no way to tell because there is no marker in the file which identifies it. Unicode encodings (e.g. UTF-8, UTF-16), on the other hand, do write out a marker at the beginning of the file, called a BOM, or byte-order mark, which can be detected programmatically. That, coupled with the fact that Unicode encodings can represent all characters, is why they are much preferred over legacy encodings.
For more information, I highly recommend reading What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text.

Is there cross-platform method to encode a string into another string without any whitespaces and then decode it back?

I am trying to pass a block of text to a system I do not own, which will pass the data to a system I do own.
Unfortunately, when the first system talks to the second system, it uses a TSV format. Thus, I wonder if there's a convenient way to take my block of text and encode it in an ASCII format without any kind of whitespace (mostly newlines and tabs, of course), and then later decode it.
When I'm doing the encoding, I'm working in C#. When I'm doing the decoding, I'm working in Javascript.
I realize that I can write my own code to essentially "manually" perform the encoding and decoding by creating my own scheme, but I wonder if there already exists one for this purpose.

One option which would blow up the size of your data but be really simple to implement: UTF-8 encode all the text, base64-encode that:
byte[] utf8 = Encoding.UTF8.GetBytes(text);
string base64 = Convert.ToBase64(utf);
That won't contain any whitespace, and can be converted back. It'll be significantly larger than the original string, and unreadable... but it'll work.

You could try using HttpUtility.UrlEncode(string) or Uri.EscapeDataString(string), which would percent-encode any whitespace in the passed in text (as well as other special characters, which means the encoded text may be much larger than the original).
On the javascript side you could then use decodeURIComponent(string) to decode it back to the original text.

"Unable to translate Unicode character" error when saving to txt file

Additional information: Unable to
translate Unicode character \uDFFF at
index 195 to specified code page.
I made an algorithm, who's result are binary values (different lengths). I transformed it into uint, and then into chars and saved into stringbuilder, as you can see below:
uint n = Convert.ToUInt16(tmp_chars, 2);
_koded_text.Append(Convert.ToChar(n));
My problem is, that when i try to save those values into .txt i get the previously mentioned error.
StreamWriter file = new StreamWriter(filename);
file.WriteLine(_koded_text);
file.Close();
What i am saving is this: "忿췾᷿］볯褟ﶞ痢ﳻ��伞ﳴ㿯ﹽ翼蛿㐻ﰻ筹��﷿￦ﾏ랿鳿⏟麞펿"... which are some weird signs.
What i need is to convert those binary values into some kind of string of chars and save it to txt. I saw somewhere that converting to UTF8 should help, but i don't know how to. Would changing files encoding help too?

You cannot transform binary data to a string directly. The Unicode characters in a string are encoded using utf16 in .NET. That encoding uses two bytes per character, providing 65536 distinct values. Unicode however has over one million codepoints. To make that work, the Unicode codepoints above \uffff (above the BMP, Basic Multilingual Plane) are encoded with a surrogate pair. The first one has a value between 0xd800 and 0xdbff, the second between 0xdc00 and 0xdfff. That provides 2 ^ (10 + 10) = 1 million additional codes.
You can perhaps see where this leads, in your case the code detects a high surrogate value (0xdfff) that isn't paired with a low surrogate. That's illegal. Lots more possible mishaps, several codepoints are unassigned, several are diacritics that get mangled when the string is normalized.
You just can't make this work. Base64 encoding is the standard way to carry binary data across a text stream. It uses 6 bits per character, 3 bytes require 4 characters. The character set is ASCII so the odds of the receiving program decoding the character back to binary incorrectly are minimal. Only a decades old IBM mainframe that uses EBCDIC could get you into trouble. Or just plain avoid encoding to text and keep it binary.

Since you're trying to encode binary data to a text stream this SO question already contains an answer to the question: "How do I encode something as base64?" From there plain ASCII/ANSI text is fine for the output encoding.

C# How can i remove newline characters from binary?

Basically i have binary data, i dont mind if it's unreadable but im writing it to a file which is parsed and so it's importance newline characters are taken out.
I thought i had done the right thing when i converted to string....
byte[] b = (byte[])SubKey.GetValue(v[i]);
s = System.Text.ASCIIEncoding.ASCII.GetString(b);
and then removed the newlines
String t = s.replace("\n","")
but its not working ?

Newline might be \r\n, and your binary data might not be ASCII encoded.

Firstly newline (Environment.Newline) is usually two characters on Windows, do you mean removing single carriage-return or line-feed characters?
Secondly, applying a text encoding to binary data is likely to lead to unexpected conversions. E.g. what will happen to buyes of the binary data that do not map to ASCII characters?

New line character may be \n or \r or \r\n depends on operating system type, in order this is markers for Linux, Macintosh and Windows.
But if you say you file is binary from what you know they have newlines in ASCII in her content?
If this is binary file this may be a some struct, if this they struct you after remove newline characters shift left all data after the this newline and corrupt data in her.

I would imagine removing the bytes in a binary chunk which correspond the line feeds would actually corrupt the binary data, thereby making it useless.
Perhaps you'd be better off using base64 encoding, which will produce ASCII-safe output.

If this is text data, then load it as text data (using the correct encoding), replace it as as a string, and re-encode it (using the correct encoding). For some encodings you might be able to do a swap at the file level (without decoding/encoding), but I wouldn't bet on it.
If this is any other binary representation, you will have to know the exact details. For example, it is common (but not for certain) for strings embedded in part of a binary file to have a length prefix. If you change the data without changing the length prefix, you've just corrupted the file. And to change the length prefix you need to know the format (it might be big-endian/little-endian, any fixed number of bytes, or the prefix itself could be variable length). Or it might be delimited. Or there might be relative offsets scattered through the file that all need fixing.
Just as likely; you could by chance have the same byte sequence in the binary that doesn't represent a newline; you could be completely trashing the data.

How to get correctly-encoded HTML from the clipboard?

Has anyone noticed that if you retrieve HTML from the clipboard, it gets the encoding wrong and injects weird characters?
For example, executing a command like this:
string s = (string) Clipboard.GetData(DataFormats.Html)
Results in stuff like:
<FONT size=-2>Â Â <A href="/advanced_search?hl=en">Advanced
Search</A><BR>Â Â Preferences<BR>Â Â <A
href="/language_tools?hl=en">Language
Tools</A></FONT>
Not sure how MarkDown will process this, but there are weird characters in the resulting markup above.
It appears that the bug is with the .NET framework. What do you think is the best way to get correctly-encoded HTML from the clipboard?

In this case it is not so visible as it was in my case. Today I tried to copy data from clipboard but there were a few unicode characters. The data I got were as if I would read a UTF-8 encoded file in Windows-1250 encoding (local encoding in my Windows).
It seems you case is the same. If you save the html data (remember to put non-breakable space = 0xa0 after the Â character, not a standard space) in Windows-1252 (or Windows-1250; both works). Then open this file as a UTF-8 file and you will see what there should be.
For my other project I made a function that fix data with corrupted encoding.
In this case simple conversion should be sufficient:
byte[] data = Encoding.Default.GetBytes(text);
text = Encoding.UTF8.GetString(data);
My original function is a little bit more complex and contains tests to ensure that data are not corrupted...
public static bool FixMisencodedUTF8(ref string text, Encoding encoding)
{
if (string.IsNullOrEmpty(text))
return false;
byte[] data = encoding.GetBytes(text);
// there should not be any character outside source encoding
string newStr = encoding.GetString(data);
if (!string.Equals(text, newStr)) // if there is any character "outside"
return false; // leave, the input is in a different encoding
if (IsValidUtf8(data) == 0) // test data to be valid UTF-8 byte sequence
return false; // if not, can not convert to UTF-8
text = Encoding.UTF8.GetString(data);
return true;
}
I know that this is not the best (or correct solution) but I did not found any other way how to fix the input...
EDIT: (July 20, 2017)
It Seems like the Microsoft already found this error and now it works correctly. I'm not sure whether the problem is in some frameworks, but I know for sure, that now the application uses a different framework as in time, when I wrote the answer. (Now it is 4.5; the previous version was 2.0)
(Now all my code fails in parsing the data. There is another problem to determine the correct behaviour for application with fix already aplied and without fix.)

You have to interpret the data as UTF-8. See MS Office hyperlinks change code page?.

DataFormats.Html specification states it's encoded in UTF-8. But there's a bug in .NET 4 Framework and lower, and it actually reads as UTF-8 as Windows-1252.
You get allot of wrong encodings, leading funny/bad characters such as
'Å','â€¹','Å’','Å½','Å¡','Å“','Å¾','Å¸','Â','Â¡','Â¢','Â£','Â¤','Â¥','Â¦','Â§','Â¨','Â©'
Full explanation here
Debugging Chart Mapping Windows-1252 Characters to UTF-8 Bytes to Latin-1 Characters
Soln: Create a translation dictionary and search and replace.

I don't know what your original source document is, but be aware that Word and Outlook provide several versions of the clipboard in different encodings. One is usually Windows-1252 and another is UTF-8. Possibly you're grabbing the UTF-8 encoded version by default, when you're expecting the Windows-1252 (Latin-1 + Smart Quotes)? Non-ASCII characters would show up as multiple odd Latin-1 accented characters. Most "Smart Quotes" are not in the Latin-1 set and are often three bytes in UTF-8.
Can you specify which encoding you want the clipboard contents in?

Try this:
System.Windows.Forms.Clipboard.GetText(System.Windows.Forms.TextDataFormat.Html);

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.