The same text has different byte size - c#

I have simple txt file, read this content, do with them some operation (for example encode and decode) and save result in file. When I compare its two files in beyond compare I see that content the same. But sizes of files are different. Why? And how I can resolve this problem?

There can be many reasons for that, for example different encoding, or maybe one is using \r\n and the other uses only \n.
Use Hex Compare in BeyondCompare to find out exactly.

Related

Properly editing a text or source file from C#/.NET

The typical way to edit a text or source file from code is to read the file using File.ReadAllLines or File.ReadAllText, make your changes, and then write it out using WriteAllLines or WriteAllText.
However, if you were to open the text file (say some source code file) in Visual Studio or Notepad++, scroll down a few lines, make a change, and save, a lot more is handled.
It seems that what is handled, at least on Windows, is a complicated set of rules and heuristics that takes into account, at a minimum:
The inferred encoding of the text file.
The line-endings
Whether the last line is an "incomplete line" (as described in diffutils/manual, namely a line with no line ending character(s)
I'll discuss these partially just to illustrate the complexity. My question is, is there a full set of heuristics, an already established algorithm that can be used or an existing component that encapsulates this.
Inferred Encoding
Most common for source / text files:
UTF-16 with BOM
UTF-8 with BOM
UTF-8 without BOM
When there's no BOM, the encoding is inferred using some heuristics.
It could be ASCII or Windows1252 (Encoding.GetEncoding(1252)), or BOMless UTF-8
It depends on what the rest of the data looks like. If there's some known upper-ascii or what might look UTF-8.
When you save, you need to keep the same encoding.
Line endings
You have to keep the same line-endings. So if the file uses CR/LF, then keep it at CR/LF.
But when it's just LF, then keep that.
But it can get more complicated then that as given text file may have both, and one would need to maintain that as well.
For example, a source file that's CR/LF may, inside of it, have a section that's only LF line-ended only.
This can happen when someone pastes text from another tool into a literal multi-line string, such as C#'s #"" strings.
Visual Studio handles this correctly.
Incomplete lines
If the last line is incomplete, that has to be maintained as well. That means, if the last line doesn't end with end-of-line character(s)
Possible approach
I think one way to get around all of these problems from the start is to treat the file as binary instead of text. This means the normal text-file processing in .NET cannot be used. A new set of APIs will be needed to handle editing such files.
I can imaging a component that requires you to open the file as a memory stream and pass that to the component. The component then can read the stream and provide a line-oriented view to clients, such that client code can iterate over the lines for processing. Each element through the iteration will be an object of a type that looks something like this:
class LineElement
{
int originalLineNumber;
string[] lines;
string[] lineEndings;
}
As an example for a normal text file on Windows:
originalLineNumber will be 1
lines will be a one-dimensional array with the first line of the file, without line-endings
lineEndings[0] will be "\x0D\x0A"
the lines field can be modified. It can be replaced with empty array to delete the line
or it can be replaced with with a multi-element array to insert lines (replacing the existing line)
lineEndings array handled similarly.
In many cases, new lines aren't removed or inserted, in which case the application code never has to deal with line-endings at all. They simply operate on the lines[] array, ignoring the lineEndings[] array.
I'm open to other suggestions.

Get only strings from binary file

I try to print to screen a string from a binary file using xaml labels, but when i display the file content I got a beautiful "corrupted" character instead of the entire file content.
I think the problem is reading the file, I already can change the label content using the most basic technique it work pretty well till today....
label.Text = mystring ;
The fact is : I have data in my binaries files that inst text (some random data that I don't care) located to the start of the file, my theory is my program start reading, read a non ascii character and stop reading...
I read using the File class, maybe the wrong thing.....
label.Text = File.ReadAllText(my_file);
So, im lock now. I don't exactly know what im supposed to do....
Hope you can help me :D
I can't tell much without looking at the text, but it seems you need to add the Encoding
Something like this:
string myText = File.ReadAllText(path, Encoding.Default);
You need to know how your binary file is structured. You need to know the encoding of the strings. A normal text file normally has markers at the beginning two or so bytes that identify its encoding if it is Unicode. This way the system can know whether its UTF-8, UTF-16, ...
If you try to read a binary file this Information is not there. Instead the reading process will most probably find unexpected binary data. So you cannot read a binary file as text. If your file is structured the way that at the beginning is binary data and later only text, just skip the first part and start reading at the start of the second part. But I don't think, that it is that easy:
if it really is binary data, chances are that the file structure is much more complicated and you need to do more work to read it.
if only the first two bytes are binary data, then maybe its a text file and you can read it without problems, you maybe only need to pass the right encoding to the reading function

C# Unrecognized characters while reading from binary file

I have some items who's information is split into two parts, one is contents of a binary file, and other is textual entry inside .txt file. I am trying to make an app that will pack this info into one textual file (textual file because I have reasons to want this file to also be humanly readable as well), with ability to later unpack that file back by creating new binary file and text entry.
The first problem I ran into so far: some info is lost when converting binary into string (or perhaps sooner, during reading of bytes), and I'm not sure if the file is in weird format or I'm doing something wrong. Some characters get shown as question marks.
Example of characters which are replaced with question marks:
ýÿÿ
This is the part where info is read from the binary file and gets encoded into a string (which is how I inteded to store it inside a text file).
byte[] binaryFile = File.ReadAllBytes(pathBinary);
// I also tried this for some reason: byte[] binaryFile = Encoding.ASCII.GetBytes(File.ReadAllText(pathBinary));
string binaryFileText = Convert.ToBase64String(binaryFile); //this is the coded string that goes into joined file to hold binary file information, when decoded the result shows question marks instead of some characters
MessageBox.Show("binary file text: " + Encoding.ASCII.GetString(binaryFile), "debug", MessageBoxButtons.OK, MessageBoxIcon.Information); //this also shows question marks
I expect a few more caveats along the way with second functionality of the app (unpacking back into text and binary), but so far my main problem is unrecognized characters during reading of the binary file or converting it into string, which makes this data unusable in storing as text for purpose of reproducing the file. Any help would be appreciated.
There is no universal conversion of binary string data to a string. A string is a series of unicode characters and as such can hold any character of the unicode range.
Binary data is a series of bytes and as such can be anything from video to a string in various formats.
Since there are multiple binary string representations, you need an Encoding to convert one into the other. The encoding you choose has to match the binary string format. If it doesn't you will get the wrong result.
You are using ASCII encoding for the conversion, which is obviously incorrect. ASCII can not encode the full unicode range. That means even if you use it for encoding, the result of the decoding will not always match the original text.
If you have both, encoding and decoding under control, use an Encoding that can do the full round trip, such as UTF8 or Unicode. If you don't encode the string yourself, use the correct Encoding.

What is ¿ in Text Files

I am working on a C# program which processes a bunch of text files. These files have been created by a system so I can't change the source, but within the file ¿ appears multiple times which is causing my code to fall over.
What does ¿ mean and how can I handle it?
¿ means you have a character that is converted from another encoding type and is not recognized in the character table of your encoding type. You may handle it if you use another encoding type.
Documentation
At the start of Unicode-encoded files is a "header". This header tells programs reading it that it's a Unicode file. This is called a "Byte order mark" and signifies to readers what type of Unicode it is. http://msdn.microsoft.com/en-us/library/windows/desktop/dd374101(v=vs.85).aspx
To elaborate on my comment, first you should find out what encoding was used when these were created, then use that encoding when reading them in. Check out:
BinaryReader(Stream, Encoding)
http://msdn.microsoft.com/en-us/library/system.io.binaryreader.aspx

C# How can i remove newline characters from binary?

Basically i have binary data, i dont mind if it's unreadable but im writing it to a file which is parsed and so it's importance newline characters are taken out.
I thought i had done the right thing when i converted to string....
byte[] b = (byte[])SubKey.GetValue(v[i]);
s = System.Text.ASCIIEncoding.ASCII.GetString(b);
and then removed the newlines
String t = s.replace("\n","")
but its not working ?
Newline might be \r\n, and your binary data might not be ASCII encoded.
Firstly newline (Environment.Newline) is usually two characters on Windows, do you mean removing single carriage-return or line-feed characters?
Secondly, applying a text encoding to binary data is likely to lead to unexpected conversions. E.g. what will happen to buyes of the binary data that do not map to ASCII characters?
New line character may be \n or \r or \r\n depends on operating system type, in order this is markers for Linux, Macintosh and Windows.
But if you say you file is binary from what you know they have newlines in ASCII in her content?
If this is binary file this may be a some struct, if this they struct you after remove newline characters shift left all data after the this newline and corrupt data in her.
I would imagine removing the bytes in a binary chunk which correspond the line feeds would actually corrupt the binary data, thereby making it useless.
Perhaps you'd be better off using base64 encoding, which will produce ASCII-safe output.
If this is text data, then load it as text data (using the correct encoding), replace it as as a string, and re-encode it (using the correct encoding). For some encodings you might be able to do a swap at the file level (without decoding/encoding), but I wouldn't bet on it.
If this is any other binary representation, you will have to know the exact details. For example, it is common (but not for certain) for strings embedded in part of a binary file to have a length prefix. If you change the data without changing the length prefix, you've just corrupted the file. And to change the length prefix you need to know the format (it might be big-endian/little-endian, any fixed number of bytes, or the prefix itself could be variable length). Or it might be delimited. Or there might be relative offsets scattered through the file that all need fixing.
Just as likely; you could by chance have the same byte sequence in the binary that doesn't represent a newline; you could be completely trashing the data.

Categories