C# Reading files and encoding issue

C# Reading files and encoding issue - c#

I've searched everywhere for this answer so hopefully it's not a duplicate. I decided I'm just finally going to ask it here.
I have a file named Program1.exe When I drag that file into Notepad or Notepad++ I get all kinds of random symbols and then some readable text. However, when I try to read this file in C#, I either get inaccurate results, or just a big MZ. I've tried all supported encodings in C#. How can notepad programs read a file like this but I simply can't? I try to convert bytes to string and it doesn't work. I try to directly read line by line and it doesn't work. I've even tried binary and it doesn't work.
Thanks for the help! :)

Reading a binary file as text is a peculiar thing to do, but it is possible. Any of the 8-bit encodings will do it just fine. For example, the code below opens and reads an executable and outputs it to the console.
const string fname = #"C:\mystuff\program.exe";
using (var sw = new StreamReader(fname, Encoding.GetEncoding("windows-1252")))
{
var s = sw.ReadToEnd();
s = s.Replace('\x0', ' '); // replace NUL bytes with spaces
Console.WriteLine(s);
}
The result is very similar to what you'll see in Notepad or Notepad++. The "funny symbols" will differ based on how your console is configured, but you get the idea.
By the way, if you examine the string in the debugger, you're going to see something quite different. Those funny symbols are encoded as C# character escapes. For example, nul bytes (value 0) will display as \0 in the debugger, as NUL in Notepad++, and as spaces on the console or in Notepad. Newlines show up as \r in the debugger, etc.
As I said, reading a binary file as text is pretty peculiar. Unless you're just looking to see if there's human-readable data in the file, I can't imagine why you'd want to do this.
Update
I suspect the reason that all you see in the Windows Forms TextBox is "MZ" is that the Windows textbox control (which is what the TextBox ultimately uses), uses the NUL character as a string terminator, so won't display anything after the first NUL. And the first thing after the "MZ" is a NUL (shows as `\0' in the debugger). You'll have to replace the 0's in the string with spaces. I edited the code example above showing how you'd do that.

The exe is a binary file and if you try to read it as a text file you'll get the effect that you are describing. Try using something like a FileStream instead that does not care about the structure of the file but treats it just as a series of bytes.

Related

Get only strings from binary file

I try to print to screen a string from a binary file using xaml labels, but when i display the file content I got a beautiful "corrupted" character instead of the entire file content.
I think the problem is reading the file, I already can change the label content using the most basic technique it work pretty well till today....
label.Text = mystring ;
The fact is : I have data in my binaries files that inst text (some random data that I don't care) located to the start of the file, my theory is my program start reading, read a non ascii character and stop reading...
I read using the File class, maybe the wrong thing.....
label.Text = File.ReadAllText(my_file);
So, im lock now. I don't exactly know what im supposed to do....
Hope you can help me :D

I can't tell much without looking at the text, but it seems you need to add the Encoding
Something like this:
string myText = File.ReadAllText(path, Encoding.Default);

You need to know how your binary file is structured. You need to know the encoding of the strings. A normal text file normally has markers at the beginning two or so bytes that identify its encoding if it is Unicode. This way the system can know whether its UTF-8, UTF-16, ...
If you try to read a binary file this Information is not there. Instead the reading process will most probably find unexpected binary data. So you cannot read a binary file as text. If your file is structured the way that at the beginning is binary data and later only text, just skip the first part and start reading at the start of the second part. But I don't think, that it is that easy:
if it really is binary data, chances are that the file structure is much more complicated and you need to do more work to read it.
if only the first two bytes are binary data, then maybe its a text file and you can read it without problems, you maybe only need to pass the right encoding to the reading function

Can not read turkish characters from text file to string array

I am trying to do some kind of sentence processing in turkish, and I am using text file for database. But I can not read turkish characters from text file, because of that I can not process the data correctly.
string[] Tempdatabase = File.ReadAllLines(#"C:\Users\dialogs.txt");
textBox1.Text = Tempdatabase[5];
Output:

It's probably an encoding issue. Try using one of the Turkish code page identifiers.
var Tempdatabase =
File.ReadAllLines(#"C:\Users\dialogs.txt", Encoding.GetEncoding("iso-8859-9"));

You can fiddle around using Encoding as much as you like. This might eventually yield the expected result, but bear in mind that this may not work with other files.
Usually, C# processes strings and files using Unicode by default. So unless you really need something else, you should try this instead:
Open your text file in notepad (or any other program) and save it as an UTF-8 file. Then, you should get the expected results without any modifications in your code. This is because C# reads the file using the encoding you saved it with. This is default behavior, which should be preferred.
When you save your text file as UTF-8, then C# will interpret it as such.
This also applies to .html files inside Visual Studio, if you notice that they are displayed incorrectly (parsed with ASCII)

The file contains the text in a specific Turkish character set, not Unicode. If you don't specify any other behaviour, .net will assume Unicode text when reading text from a text file. You have two possible solutions:
Either change the text file to use Unicode (for example utf8) using an external text editor.
Or specify a specific character set to read for example:
string[] Tempdatabase = File.ReadAllLines(#"C:\Users\dialogs.txt", Encoding.Default);
This will use the local character set of the Windows system.
string[] Tempdatabase = File.ReadAllLines(#"C:\Users\dialogs.txt", Encoding.GetEncoding("Windows-1254");
This will use the Turkish character set defined by Microsoft.

Print a string in Hex representation or print control characters as literals

I have an application which uses clipboard. Users copy data from excel and paste it into my application.
However, sometimes my application cannot handle the data correctly (in production). But I cannot reproduce the problem. Because the spreadsheet user is using contains his own macro and I am not allowed to have his original spreadsheet, I am thinking to add code to dump the text in the clipboard completely to a log file, so that I can see if the data being pasted is in the correct format.
I am especially concerned with the control characters contained in the clipboard. Therefore I want to dump the string as literal. For example, if string in the clipboard contains '\t', I want to see "\t" in the log file, instead of a tab. Is there a way to do this?
Alternatively I can log the text in the hex format. But it seems I need to first convert the string to a char[] and use System.Buffer.BlockCopy to copy the char[] to a byte[]. And then print the byte[] using BitConverter.ToString(). It works fine. But is there a better way (easier, no need to copy, no need to loop through every character) to do this? .Net does not have a build-in function for this?
I used this as reference: byte[] to hex string
Methods suggested there look a little bit "heavy".

Simple Try this
"some text \t some text".Replace("\t",#"\t");

How to Handle Accented Characters in a Directory Name

I have a problem with using Directory.Exists() on a string that contains an accented character.
This is the directory path: D:\ést_test\scenery. It is coming in as a simple string in a file that I am parsing:
[Area.121]
Title=ést_test
local=D:\AITests\ést_test
Layer=121
Active=FALSE
Required=FALSE
My code is taking the local value and adding \scenery to it. I need to test that this exists (which it does) and am simply using:
if (!Directory.Exists(area.Path))
{
// some handling code
area.AreaIsValid = false;
}
This returns false. It seems that the string handling that I am doing is replacing the accented character. The text visualizer in VS2012 is showing this (directoryManager is just a wrap around System.IO.Directory):
And the warning message as displayed is showing this:
So it seems that the accented character is not being recognized. Searching for this issue does turn up but mostly about removing or replacing the accented character. I am currently using 'normal' string handling. I tried using FileInfo but the path seems to get mangled anyway.
So my first question is how do I get the path stored into a string so that it will pass the Directory.Exists test?
This raises a wider question of non latin characters in path names. I have users all over the world so I can see arabic. Russian, Chinese and so on in paths. How can I handle all of these?

The problem is almost certainly that you're loading the file with the wrong encoding. The fact that it's a filename is irrelevant - the screenshots show that you've lost the relevant data before you call Directory.Exists.
You should make sure you know the file encoding (e.g. UTF-8, Cp1252 etc) and then pass that in as an argument into however you're loading the file (e.g. File.ReadAllText). If this isn't enough information to get you going, you'll need to tell us more about the file (to work out what encoding it's in) and more about your code (how you're reading it).
Once you've managed to load the correct data, I'd hope that the file aspect just handles itself automatically.

Error reading a from file - Encoding issue

Im reading a CSV file that was created from MS Excel. When I open it up in notepad it looks ok, but in Notepad++ I change the Encoding from ANSI to UTF8 and a few non printed characters turn up.
Specifically xFF. -(HEX Value)
In my C# app this character is causing an issue when reading the file so is there a way I can do a String.replace('xFF', ' '); on this?
Update
I found this link on SO, as it turns out it is the answer to my question but not my problem.
Link

Instead of String.Replace, Specify encoding while reading the file.
Example
File.ReadAllText("test.csv",System.Text.UTF8Encoding)

Guess your unicode representation is wrong. Try this
string foo = "foo\xff";
foo.Replace('\xff',' ');

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

C# Reading files and encoding issue - c#

The exe is a binary file and if you try to read it as a text file you'll get the effect that you are describing. Try using something like a FileStream instead that does not care about the structure of the file but treats it just as a series of bytes.

Related

Get only strings from binary file

Can not read turkish characters from text file to string array

Print a string in Hex representation or print control characters as literals

How to Handle Accented Characters in a Directory Name

Error reading a from file - Encoding issue

Categories

Resources