Problems reading text file - non-breaking space character

Problems reading text file - non-breaking space character - c#

I am trying to read some text file with the following line:
"WE BUY : 10 000.00 USD"
First I have opened this file in binary editor and the 13th character (or 12th by 0-based C# indexing) (thousand separator) is "160 code in decimal" or "A0 code in hex" in Windows-1251 encoding.
However after I read this line into string using File.ReadAllLines
in the debugger I can see the the character now has 65533 code.
"lines[9][12] 65533 '�' char"
The default encoding Encoding.Default for my PC is "Windows-1251".
How come?
UPDATE
Tryed open the file with UTF-8 encoding, still the same result.
UPDATE 2
The problem is that file encoding is 8bit but the debugger shows for 8bit character 'A0' 16 bit value of 65533.

The one argument File.ReadAllLines will assume the input is UTF-8, whatever the system default encoding.
For anything else you need to specify the encoding:
var lines = File.ReadAllLines(filename, Encoding.GetEncoding(name));
You can get name from your Encoding.Default.WebName ("Windows-1252" is what I get here, but check locally).

Related

Why Does Byte 150 show up as a dash in Notepad but Not when I read it programatically?

I've got a file that looks OK in Notepad (and Notepad++) but when I try to read it with a C# program, the dash shows up as a replacement character (�) instead. After some trial and error, I can reproduce the error as follows:
File.WriteAllBytes("C:\\Misc\\CharTest\\wtf.txt", new byte[] { 150 });
var readFile = File.ReadAllText("C:\\Misc\\CharTest\\wtf.txt");
Console.WriteLine(readFile);
Now, if you go and look in the wtf.txt file using Notepad, you'll see a dash... but I don't get it. I know that's not a "real" Unicode value so that's probably the root of the issue, but I don't get why it looks fine in Notepad and not when I read in the file. And how do I get the file to read it as a dash?
As an aside, a VB6 program I'm trying to rewrite in C# also reads it as a dash.

The File.ReadAllText(string) overload defaults to UTF8 encoding, in which a standalone byte with value 150 is invalid.
Specify the actual encoding of the file, for example:
var encoding = Encoding.GetEncoding(1252);
string content = File.ReadAllText(fileName, encoding);
I used the Windows-1252 encoding, which has a dash at codepoint 150.
Edit: Notepad displays the file correctly because for non-Unicode files the Windows-1252 codepage is the default for western regional settings. So likely you can use also Encoding.Default to get the correct result but keep in mind that Encoding.Default can return different code pages with different regional settings.

You are writing bytes in a textfile. And the you are reading those bytes and interpret them as chars.
Now, when you write bytes, you don't care about encoding, while you have to, in order to read those very same bytes as char.
Notepad++ seems to interpret the byte as Unicode char and therefore prints the _.
Now File.ReadAllText reads the bytes in the specified encoding, which you did not specify and there will be set to one of these and seems to be UTF-8, where 150 is not a valid entry.

C# Unrecognized characters while reading from binary file

I have some items who's information is split into two parts, one is contents of a binary file, and other is textual entry inside .txt file. I am trying to make an app that will pack this info into one textual file (textual file because I have reasons to want this file to also be humanly readable as well), with ability to later unpack that file back by creating new binary file and text entry.
The first problem I ran into so far: some info is lost when converting binary into string (or perhaps sooner, during reading of bytes), and I'm not sure if the file is in weird format or I'm doing something wrong. Some characters get shown as question marks.
Example of characters which are replaced with question marks:
ýÿÿ
This is the part where info is read from the binary file and gets encoded into a string (which is how I inteded to store it inside a text file).
byte[] binaryFile = File.ReadAllBytes(pathBinary);
// I also tried this for some reason: byte[] binaryFile = Encoding.ASCII.GetBytes(File.ReadAllText(pathBinary));
string binaryFileText = Convert.ToBase64String(binaryFile); //this is the coded string that goes into joined file to hold binary file information, when decoded the result shows question marks instead of some characters
MessageBox.Show("binary file text: " + Encoding.ASCII.GetString(binaryFile), "debug", MessageBoxButtons.OK, MessageBoxIcon.Information); //this also shows question marks
I expect a few more caveats along the way with second functionality of the app (unpacking back into text and binary), but so far my main problem is unrecognized characters during reading of the binary file or converting it into string, which makes this data unusable in storing as text for purpose of reproducing the file. Any help would be appreciated.

There is no universal conversion of binary string data to a string. A string is a series of unicode characters and as such can hold any character of the unicode range.
Binary data is a series of bytes and as such can be anything from video to a string in various formats.
Since there are multiple binary string representations, you need an Encoding to convert one into the other. The encoding you choose has to match the binary string format. If it doesn't you will get the wrong result.
You are using ASCII encoding for the conversion, which is obviously incorrect. ASCII can not encode the full unicode range. That means even if you use it for encoding, the result of the decoding will not always match the original text.
If you have both, encoding and decoding under control, use an Encoding that can do the full round trip, such as UTF8 or Unicode. If you don't encode the string yourself, use the correct Encoding.

How do I use C#'s IndexOf when strange characters are in the string

Below is what the text looks like when viewed in NotePad++.
I need to get the IndexOf for that peice of the string. for use the the below code. And I can't figure out how to use the odd characters in my code.
int start = text.IndexOf("AppxxxxxxDB INFO");
Where the "xxxxx"'s represent the strange characters.

All these characters have corresponding ASCII codes, you can insert them in a string by escaping it.
For instance:
"App\x0000\x0001\x0000\x0003\x0000\x0000\x0000DB INFO"
or shorter:
"App\x00\x01\x00\x03\x00\x00\x00"+"DB INFO"
\xXXXX means you specify one character with XXXX the hexadecimal number corresponding to the character.
Notepad++ simply wants to make it a bit more convenient by rendering these characters by printing the abbreviation in a "bubble". But that's just rendering.
The origin of these characters is printer (and other media) directives. For instance you needed to instruct a printer to move to the next line, stop the printing job, nowadays they are still used. Some terminals use them to communicate color changes, etc. The most well known is \n or \x000A which means you start a new line. For text they are thus characters that specify how to handle text. A bit equivalent to modern html, etc. (although it's only a limited equivalence). \n is thus only a new line because there is a consensus about that. If one defines his/her own encoding, he can invent a new system.

Echoing #JonSkeet's warning, when you read a file into a string, the file's bytes are decoded according to a character set encoding. The decoder has to do something with bytes values or sequences that are invalid per the encoding rules. Typical decoders substitute a replacement character and attempt to go on.
I call that data corruption. In most cases, I'd rather have the decoder throw an exception.
You can use a standard decoder, customize one or create a new one with the Encoding class to get the behavior you want. Or, you can preserve the original bytes by reading the file as bytes instead of as text.
If you insist on reading the file as text, I suggest using the 437 encoding because it has 256 characters, one for every byte value, no restrictions on byte sequences and each 437 character is also in Unicode. The bytes that represent text will possibly decode the same characters that you want to search for as strings, but you have to check, comparing 437 and Unicode in this table.
Really, you should have and follow the specification for the file type you are reading. After all, there is no text but encoded text, and you have to know which encoding it is.

Read mixed encoding string

I read some string with (windows-1256) encoding but the numbers in that string encoded using (UTF-8) and as a result all text except numbers (encoded with utf-8) read but numbers displays as (?) which is acceptable. but i want to know how can i read complete text without problem, how can i know when to switch between encodings to read correct text.
NOTE: Browsers displays these kind of text correctly so they know when they should switch
Any solution or code ?

The lower half of the windows-1256 code page is the same as ASCII. Digits in UTF-8 are also the same as ASCII - if you read the string with windows-1256 encoding, it should work just fine.

Weird character "Â" before degrees Celcius symbol "°C"

I asked this question a day ago regarding Greek Unicode characters, and now I have a question which builds upon that one.
After extracting all my data, I have attempted to prepare it for import into Excel. I had to chose a tab delimited file because some of my data contains commas (lucky me!).
The issue I'm running into is a very weird character after I import the data into Excel.
The column data in Notepad++ looks like this:
Total Suspended Solids #105°C
The Excel cell data looks like this:
Total Suspended Solids #105Â°C
I don't understand why this is happening. Does this have something to do with how the degrees symbol is represented?
p.s. I the symbols in this question are direct copy and paste

(More likely) Excel is interpreting your textual data as latin-1 or windows-1252, and not UTF-8. "Â°" is what you get if you take the UTF-8 bytes for "°" (0xc2 0xb0) and interpret each byte as a character of latin-1 or windows-1252. Is there an option for input encoding when you do your import?
(Less likely) Excel is doing the right thing, but you're double-encoding your data (encoding as UTF-8, then re-interpreting it as an 8-bit encoding and encoding again as UTF-8 or any other Unicode encoding). Notepad++ evidence is against this one.

I'm not absolutely sure, but I think Excel expects Windows-1252 character encoding, so make sure you create your text file using Encoding.GetEncoding("Windows-1252").
For example:
using (var writer = new StreamWriter(fileName,false,Encoding.GetEncoding("Windows-1252"))
{
....
}

You can use UTF-8 BOM for your file.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Problems reading text file - non-breaking space character - c#

Related

Why Does Byte 150 show up as a dash in Notepad but Not when I read it programatically?

C# Unrecognized characters while reading from binary file

How do I use C#'s IndexOf when strange characters are in the string

Read mixed encoding string

Weird character "Â" before degrees Celcius symbol "°C"

Categories

Resources