Get only strings from binary file - c#

I try to print to screen a string from a binary file using xaml labels, but when i display the file content I got a beautiful "corrupted" character instead of the entire file content.
I think the problem is reading the file, I already can change the label content using the most basic technique it work pretty well till today....
label.Text = mystring ;
The fact is : I have data in my binaries files that inst text (some random data that I don't care) located to the start of the file, my theory is my program start reading, read a non ascii character and stop reading...
I read using the File class, maybe the wrong thing.....
label.Text = File.ReadAllText(my_file);
So, im lock now. I don't exactly know what im supposed to do....
Hope you can help me :D

I can't tell much without looking at the text, but it seems you need to add the Encoding
Something like this:
string myText = File.ReadAllText(path, Encoding.Default);

You need to know how your binary file is structured. You need to know the encoding of the strings. A normal text file normally has markers at the beginning two or so bytes that identify its encoding if it is Unicode. This way the system can know whether its UTF-8, UTF-16, ...
If you try to read a binary file this Information is not there. Instead the reading process will most probably find unexpected binary data. So you cannot read a binary file as text. If your file is structured the way that at the beginning is binary data and later only text, just skip the first part and start reading at the start of the second part. But I don't think, that it is that easy:
if it really is binary data, chances are that the file structure is much more complicated and you need to do more work to read it.
if only the first two bytes are binary data, then maybe its a text file and you can read it without problems, you maybe only need to pass the right encoding to the reading function

Related

How can I write a custom encoder for the below text in C# .NET

I'm attempting to read a .txt file and place the text into an object. Then later on serialize said object and write to another .txt, all while keeping the exact same characters.
I've tried using 'iso-8859-1' encoding when using File.ReadAllLines() but I get following:
Result
I've also tried creating a custom JavascriptEncoder for serialization but that did not work, I'm assuming since the read wasn't even getting the correct characters.
Is there a way I can write a custom encoder for both File.ReadAllLines() and JsonSerializer.Serialize() so that I can keep the exact same characters throughout. Thanks
Edit : I removed the encoding entirely and it worked for most characters, but still returns 'œ' as 'o'.
Original Text:
sfør
Är du säker på a
un¹æ ko
róciæ kolejnoœæ numeró
e¿y pamiêtaæ, ¿e w
aŸn
nieœ w górê
g³ówna
w³aœc
Ultimately, if you're going to read and write text: you need to know what encoding you're meant to be using. You cannot usually guess. There's not really any such thing as a "text file"; there's just a binary file that your code is going to translate to text via an encoding; either the system can guess, or you can tell it. These days, UTF8 is a pragmatic default, and ANSI encodings such as iso-8859-1 should usually be considered legacy and reserved for handling data that is limited to that specific codepage for historic reasons
So, either:
determine what encoding you're meant to be using and use that for both read and write, or
treat the data as raw bytes, without attempting to parse it into string (etc) data

Properly editing a text or source file from C#/.NET

The typical way to edit a text or source file from code is to read the file using File.ReadAllLines or File.ReadAllText, make your changes, and then write it out using WriteAllLines or WriteAllText.
However, if you were to open the text file (say some source code file) in Visual Studio or Notepad++, scroll down a few lines, make a change, and save, a lot more is handled.
It seems that what is handled, at least on Windows, is a complicated set of rules and heuristics that takes into account, at a minimum:
The inferred encoding of the text file.
The line-endings
Whether the last line is an "incomplete line" (as described in diffutils/manual, namely a line with no line ending character(s)
I'll discuss these partially just to illustrate the complexity. My question is, is there a full set of heuristics, an already established algorithm that can be used or an existing component that encapsulates this.
Inferred Encoding
Most common for source / text files:
UTF-16 with BOM
UTF-8 with BOM
UTF-8 without BOM
When there's no BOM, the encoding is inferred using some heuristics.
It could be ASCII or Windows1252 (Encoding.GetEncoding(1252)), or BOMless UTF-8
It depends on what the rest of the data looks like. If there's some known upper-ascii or what might look UTF-8.
When you save, you need to keep the same encoding.
Line endings
You have to keep the same line-endings. So if the file uses CR/LF, then keep it at CR/LF.
But when it's just LF, then keep that.
But it can get more complicated then that as given text file may have both, and one would need to maintain that as well.
For example, a source file that's CR/LF may, inside of it, have a section that's only LF line-ended only.
This can happen when someone pastes text from another tool into a literal multi-line string, such as C#'s #"" strings.
Visual Studio handles this correctly.
Incomplete lines
If the last line is incomplete, that has to be maintained as well. That means, if the last line doesn't end with end-of-line character(s)
Possible approach
I think one way to get around all of these problems from the start is to treat the file as binary instead of text. This means the normal text-file processing in .NET cannot be used. A new set of APIs will be needed to handle editing such files.
I can imaging a component that requires you to open the file as a memory stream and pass that to the component. The component then can read the stream and provide a line-oriented view to clients, such that client code can iterate over the lines for processing. Each element through the iteration will be an object of a type that looks something like this:
class LineElement
{
int originalLineNumber;
string[] lines;
string[] lineEndings;
}
As an example for a normal text file on Windows:
originalLineNumber will be 1
lines will be a one-dimensional array with the first line of the file, without line-endings
lineEndings[0] will be "\x0D\x0A"
the lines field can be modified. It can be replaced with empty array to delete the line
or it can be replaced with with a multi-element array to insert lines (replacing the existing line)
lineEndings array handled similarly.
In many cases, new lines aren't removed or inserted, in which case the application code never has to deal with line-endings at all. They simply operate on the lines[] array, ignoring the lineEndings[] array.
I'm open to other suggestions.

Replacing a word in a text file

I'm doing a little program where the data saved on some users are stored in a text file. I'm using Sytem.IO with the Streamwriter to write new information to my text file.
The text in the file is formatted like so :
name1, 1000, 387
name2, 2500, 144
... and so on. I'm using infos = line.Split(',') to return the different values into an array that is more useful for searching purposes. What I'm doing is using a While loop to search for the correct line (where the name match) and I return the number of points by using infos[1].
I'd like to modify this infos[1] value and set it to something else. I'm trying to find a way to replace a word in C# but I can't find a good way to do it. From what I've read there is no way to replace a single word, you have to rewrite the complete file.
Is there a way to delete a line completely, so that I could rewrite it at the end of the text file and not have to worried about it being duplicated?
I tried using the Replace keyword, but it didn't work. I'm a bit lost by looking at the answers proposed for similar problems, so I would really appreciate if someone could explain me what my options are.
If I understand you correctly, you can use File.ReadLines method and LINQ to accomplish this.First, get the line you want:
var line = File.ReadLines("path")
.FirstOrDefault(x => x.StartsWith("name1 or whatever"));
if(line != null)
{
/* change the line */
}
Then write the new line to your file excluding the old line:
var lines = File.ReadLines("path")
.Where(x => !x.StartsWith("name1 or whatever"));
var newLines = lines.Concat(new [] { line });
File.WriteAllLines("path", newLines);
The concept you are looking for is called 'RandomAccess' for file reading/writing. Most of the easy-to-use I/O methods in C# are 'SequentialAccess', meaning you read a chunk or a line and move forward to the next.
However, what you want to do is possible, but you need to read some tutorials on file streams. Here is a related SO question. .NET C# - Random access in text files - no easy way?
You are probably either reading the whole file, or reading it line-for-line as part of your search. If your fields are fixed length, you can read a fixed number of bytes, keep track of the Stream.Position as you read, know how many characters you are going to read and need to replace, and then open the file for writing, move to that exact position in the stream, and write the new value.
It's a bit complex if you are new to streams. If your file is not huge, copying a file line for line can be done pretty efficiently by the System.IO library if coded correctly, so you might just follow your second suggestion which is read the file line-for-line, write it to a new Stream (memory, temp file, whatever), replace the line in question when you get to that value, and when done, replace the original.
It is most likely you are new to C# and don't realize the strings are immutable (a fancy way of saying you can't change them). You can only get new strings from modifying the old:
String MyString = "abc 123 xyz";
MyString.Replace("123", "999"); // does not work
MyString = MyString.Replace("123", "999"); // works
[Edit:]
If I understand your follow-up question, you could do this:
infos[1] = infos[1].Replace("1000", "1500");

C# Reading files and encoding issue

I've searched everywhere for this answer so hopefully it's not a duplicate. I decided I'm just finally going to ask it here.
I have a file named Program1.exe When I drag that file into Notepad or Notepad++ I get all kinds of random symbols and then some readable text. However, when I try to read this file in C#, I either get inaccurate results, or just a big MZ. I've tried all supported encodings in C#. How can notepad programs read a file like this but I simply can't? I try to convert bytes to string and it doesn't work. I try to directly read line by line and it doesn't work. I've even tried binary and it doesn't work.
Thanks for the help! :)
Reading a binary file as text is a peculiar thing to do, but it is possible. Any of the 8-bit encodings will do it just fine. For example, the code below opens and reads an executable and outputs it to the console.
const string fname = #"C:\mystuff\program.exe";
using (var sw = new StreamReader(fname, Encoding.GetEncoding("windows-1252")))
{
var s = sw.ReadToEnd();
s = s.Replace('\x0', ' '); // replace NUL bytes with spaces
Console.WriteLine(s);
}
The result is very similar to what you'll see in Notepad or Notepad++. The "funny symbols" will differ based on how your console is configured, but you get the idea.
By the way, if you examine the string in the debugger, you're going to see something quite different. Those funny symbols are encoded as C# character escapes. For example, nul bytes (value 0) will display as \0 in the debugger, as NUL in Notepad++, and as spaces on the console or in Notepad. Newlines show up as \r in the debugger, etc.
As I said, reading a binary file as text is pretty peculiar. Unless you're just looking to see if there's human-readable data in the file, I can't imagine why you'd want to do this.
Update
I suspect the reason that all you see in the Windows Forms TextBox is "MZ" is that the Windows textbox control (which is what the TextBox ultimately uses), uses the NUL character as a string terminator, so won't display anything after the first NUL. And the first thing after the "MZ" is a NUL (shows as `\0' in the debugger). You'll have to replace the 0's in the string with spaces. I edited the code example above showing how you'd do that.
The exe is a binary file and if you try to read it as a text file you'll get the effect that you are describing. Try using something like a FileStream instead that does not care about the structure of the file but treats it just as a series of bytes.

The same text has different byte size

I have simple txt file, read this content, do with them some operation (for example encode and decode) and save result in file. When I compare its two files in beyond compare I see that content the same. But sizes of files are different. Why? And how I can resolve this problem?
There can be many reasons for that, for example different encoding, or maybe one is using \r\n and the other uses only \n.
Use Hex Compare in BeyondCompare to find out exactly.

Categories