From the MSDN pages for StreamWriter and BinaryWriter you can clearly see the differences:
StreamWriter:
Implements a TextWriter for writing
characters to a stream in a particular
encoding.
And:
BinaryWriter:
Writes primitive types in binary to a
stream and supports writing strings in
a specific encoding.
I know that the most important feature of BinaryWriter is that it can wirte primitive types in binary, which can be more compact and efficient (consider writing the integer 23861398 - the binary writer would require 4 bytes, but the stream writer would require 8, 16, or even 32 depending on the encoding).
But when it comes to write strings, can we say StreamWriter and BinaryWriter are interchangeable?
From the BinaryWriter documentation:
Writes a length-prefixed string to this stream in the current encoding of the BinaryWriter, and advances the current position of the stream in accordance with the encoding used and the specific characters being written to the stream.
and:
Length-prefixed means that this method first writes the length of the string, in bytes, when encoded with the BinaryWriter instance's current encoding to the stream. This value is written as an unsigned integer. This method then writes that many bytes to the stream.
For example, the string "A" has a length of 1, but when encoded with UTF-16; the length is 2 bytes, so the value written in the prefix is 2, and 3 bytes are written to the stream, including the prefix.
The StreamWriter class does not write any string-lengths to the output. It just writes the text itself
Both do respect the encoding you have specified when you create the object. The actual text being written uses that encoding.
So depending on what you mean by "interchangeable", they either are, or they are not. I would say they are not, but if all you are looking at is the sequence of bytes that represents the text written itself, one might consider them to be.
To address your comment:
I'm new to BinaryWriter, so why it need to write length to the stream first? why not just write "real data" directly?
Unlike StreamWriter, which only writes one kinds of data, BinaryWriter can write all sorts of data, include raw bytes and various primitive types, to a single Stream. When writing each of these, it needs a way of indicating where that particular section of data ends. That's so that BinaryWriter's reading counterpart, BinaryReader, has a way of knowing where each section of data ends.
For some things, it's simple, because the data item itself is fixed size. A System.Int32 is 4 bytes, a System.Int64 is 8, and so on. They're fixed-length, so when reading you just read the expected number of bytes.
But string objects are variable length. There are two common approaches to handling this: prefixing the string with a count, or terminating the string with a null character ('\0'). In-memory .NET strings are counted, not null-terminated, and so string objects can actually contain a null character, and so null-terminating the string doesn't work for storing a string in a file.
So the prefix with the count of bytes is used instead.
No, they're absolutely not interchangeable. BinaryWriter is an opinionated basic IO tool for writing to binary outputs; StreamWriter is for writing text outputs. Any times that the output looks similar is purely incidental - they should absolutely not be used interchageably.
For strings, they will look similar, but that's simply because both are ultimately just using a text-encoding. The boilerplate around them is very different though (with the BinaryWriter using length-prefix, etc).
If you're looking to use one as though it were the other: you're probably doing something very wrong. If you clarify what you're trying to do, we can probably offer guidance.
Related
I am currently using Unicode in bytes and using Encoding class to get bytes and get strings.
However, I saw there is an encoder class and it seems like doing the same thing as the encoding class. Does anyone know what is the difference between them and when to use either of them.
Here are the Microsoft documentation page:
Encoder: https://msdn.microsoft.com/en-us/library/system.text.encoder(v=vs.110).aspx
Encoding: https://msdn.microsoft.com/en-us/library/system.text.encoding(v=vs.110).aspx
There is definitely a difference. An Encoding is an algorithm for transforming a sequence of characters into bytes and vice versa. An Encoder is a stateful object that transforms sequences of characters into bytes. To get an Encoder object you usually call GetEncoder on an Encoding object. Why is it necessary to have a stateful tranformation? Imagine you are trying to efficiently encode long sequences of characters. You want to avoid creating a lot of arrays or one huge array. So you break the characters down into say reusable 1K character buffers. However this might make some illegal characters sequences, for example a utf-16 surrogate pair broken across to separate calls to GetBytes. The Encoder object knows how to handle this and saves the necessary state across successive calls to GetBytes. Thus you use an Encoder for transforming one block of text that is self-contained. I believe you can reuse an Encoder instance more transforms of multiple sections of text as long as you have called GetBytes with flush equal to true on the last array of characters. If you just want to easily encode short strings, use the Encoding.GetBytes methods. For the decoding operations there is a similar Decoder class that holds the decoding state.
Alright, so I basically want to read any file with a specific extension. Going through all the bytes and reading the file is basically easy, but what about getting the type of the next byte? For example:
while ((int)reader.BaseStream.Position != RecordSize * RecordsCount)
{
// How do I check what type is the next byte gonna be?
// Example:
// In every file, the first byte is always a uint:
uint id = reader.GetUInt32();
// However, now I need to check for the next byte's type:
// How do I check the next byte's type?
}
Bytes don't have a type. When data in some language type, such as a char or string or Long is converted to bytes and written to a file, there is no strict way to tell what the type was : all bytes look alike, a number from 0-255.
In order to know, and to convert back from bytes to structured language types, you need to know the format that the file was written in.
For example, you might know that the file was written as an ascii text file, and hence every byte represents one ascii character.
Or you might know that your file was written with the format {uint}{50 byte string}{linefeed}, where the first 2 bytes represent a uint, the next 50 a string, followed by a linefeed.
Because all bytes look the same, if you don't know the file format you can't read the file in a semantically correct way. For example, I might send you a file I created by writing out some ascii text, but I might tell you that the file is full of 2-byte uints. You would write a program to read those bytes as 2-byte uints and it would work : any 2 bytes can be interpreted as a uint. I could tell someone else that the same file was composed of 4-byte longs, and they could read it as 4-byte longs : any 4 bytes can be interpreted as a long. I could tell someone else the file was a 2 byte uint followed by 6 ascii characters. And so on.
Many types of files will have a defined format : for example, a Windows executable, or a Linux ELF binary.
You might be able to guess the types of the bytes in the file if you know something about the reason the file exists. But somehow you have to know, and then you interpret those bytes according to the file format description.
You might think "I'll write the bytes with a token describing them, so the reading program can know what each byte means". For example, a byte with a '1' might mean the next 2 bytes represent a uint, a byte with a '2' might mean the following byte tells the length of a string, and the bytes after that are the string, and so on. Sure, you can do that. But (a) the reading program still needs to understand that convention, so everything I said above is true (it's turtles all the way down), (b) that approach uses a lot of space to describe the file, and (c) The reading program needs to know how to interpret a dynamically described file, which is only useful in certain circumstances and probably means there is a meta-meta format describing what the embedded meta-format means.
Long story short, all bytes look the same, and a reading program has to be told what those bytes represent before it can use them meaningfully.
Of all the example codes I have read online regarding SerialPorts all uses ReadByte then convert to Character instead of using ReadChar in the first place.
Is there a advantage in doing this?
The SerialPort.Encoding property is often misunderstood. The default is ASCIIEncoding, it will produce ? for byte values 0x80..0xFF. So they don't like getting these question marks. If you see such code then converting the byte to char directly then they are getting it really wrong, Unicode has lots of unprintable codepoints in that byte range and the odds that the device actually meant to send these characters are zero. A string tends to be regarded as easier to handle than a byte[], it is.
When you use ReadChar it is based on the encoding you are using, like #Preston Guillot said. According to the docu of ReadChar:
This method reads one complete character based on the encoding.
Use caution when using ReadByte and ReadChar together. Switching
between reading bytes and reading characters can cause extra data to
be read and/or other unintended behavior. If it is necessary to switch
between reading text and reading binary data from the stream, select a
protocol that carefully defines the boundary between text and binary
data, such as manually reading bytes and decoding the data.
I'm debugging some issues with writing pieces of an object to a file and I've gotten down to the base case of just opening the file and writing "TEST" in it. I'm doing this by something like:
static FileStream fs;
static BinaryWriter w;
fs = new FileStream(filename, FileMode.Create);
w = new BinaryWriter(fs);
w.Write("test");
w.Close();
fs.Close();
Unfortunately, this ends up prepending a box to the front of the file and it looks like so:
TEST, with a fun box on the front. Why is this, and how can I avoid it?
Edit: It does not seem to be displaying the box here, but it's the unicode character that looks like gibberish.
They are not byte-order marks but a length-prefix, according to MSDN:
public virtual void Write(string value);
Writes a length-prefixed string to
[the] stream
And you will need that length-prefix if you ever want to read the string back from that point. See BinaryReader.ReadString().
Additional
Since it seems you actually want a File-Header checker
Is it a problem? You read the length-prefix back so as a type-check on the File it works OK
You can convert the string to a byte[] array, probably using Encoding.ASCII. But hen you have to either use a fixed (implied) length or... prefix it yourself. After reading the byte[] you can convert it to a string again.
If you had a lot of text to write you could even attach a TextWriter to the same stream. But be careful, the Writers want to close their streams. I wouldn't advice this in general, but it is good to know. Here too you will have to mark a Point where the other reader can take over (fixed header works OK).
That's because a BinaryWriter is writing the binary representation of the string, including the length of the string. If you were to write straight data (e.g. byte[], etc.) it won't include that length.
byte[] text = System.Text.Encoding.Unicode.GetBytes("test");
FileStream fs = new FileStream("C:\\test.txt", FileMode.Create);
BinaryWriter writer = new BinaryWriter(fs);
writer.Write(text);
writer.Close();
You'll notice that it doesn't include the length. If you're going to be writing textual data using the binary writer, you'll need to convert it first.
The byte at the start is the length of the string, it's written out as a variable-length integer.
If the string is 127 characters or less, the length will be stored as one byte. When the string hits 128 characters, the length is written out as 2, and it will move to 3 and 4 at some lengths as well.
The problem here is that you're using BinaryWriter, which writes out data that BinaryReader can read back in later. If you wish to write out in a custom format of your own, you must either drop writing strings like that, or drop using BinaryWriter altogether.
As Henk pointed out in this answer, this is the length of the string (as a 32-bit int).
If you don't want this, you can either write "TEST" manually by writing the ASCII characters for each letter as bytes, or you could use:
System.Text.Encoding.UTF8.GetBytes("TEST")
And write the resulting array (which will NOT contain a length int)
What you're seeing is actually a 7 bit encoded integer, which is a kind of integer compression.
The BinaryWriter prepend the text with this so readers (i.e. BinaryReader) will know how long the written string is.
BinaryWriter.Write7BitEncodedInt
BinaryReader.Read7BitEncodedInt
You can read more about the implementation details of this at http://dpatrickcaldwell.blogspot.se/2011/09/7-bit-encoding-with-binarywriter-in-net.html.
You can save it as a UTF8 encoded byte array like this:
...
BinaryWriter w = new BinaryWriter(fs);
w.Write(UTF8Encoding.Default.GetBytes("test"));
...
That's a byte order mark, most likely. It's because the stream's encoding is set to Unicode.
Remember that Java strings are internally encoded in UTF-16.
So, "test" is actually made of the bytes 0xff, 0xfe (together the byte order mark), 0x74, 0x00, 0x65, 0x00, 0x73, 0x00, 0x74, 0x00.
You probably want to work with bytes instead of streams of characters.
Sounds like byte order marks.
http://en.wikipedia.org/wiki/Byte-order_mark
Perhaps you want to write the string as UTF-8.
I need to be able to read a file format that mixes binary and non-binary data. Assuming I know the input is good, what's the best way to do this? As an example, let's take a file that has a double as the first line, a newline (0x0D 0x0A) and then ten bytes of binary data afterward. I could, of course, calculate the position of the newline, then make a BinaryReader and seek to that position, but I keep thinking that there has to be a better way.
You can use System.IO.BinaryReader. The problem with this though is you must know what type of data you are going to be reading before you call any of the Read methods.
Read(byte[], int, int)
Read(char[], int, int)
Read()
Read7BitEncodedInt()
ReadBoolean()
ReadByte()
ReadBytes(int)
ReadChar()
ReadChars()
ReadDecimal()
ReadDouble()
ReadInt16()
ReadInt32()
ReadInt64()
ReadSByte()
ReadSingle()
ReadString()
ReadUInt16()
ReadUInt32()
ReadUInt64()
And of course the same methods exist for writing in System.IO.BinaryWriter.
Is this file format already fixed? If it's not, it's a really good idea to change to use a length-prefixed format for the strings. Then you can read just the right amount and convert it to a string.
Otherwise, you'll need to read chunks from the file, scan for the newline, and decode the right amount of data or (if you don't find the newline) either buffer it somewhere else (e.g. a MemoryStream) or just remember the starting point and rewind the stream appropriately. It will be ugly, but that's just because of the deficiency of the file format.
I would suggest you don't "over-decode" (i.e. decode the arbitrary binary data after the string) - while it may well not do any harm, in some encodings you could be reading an impossible sequence of binary data, which then starts getting into the realms of DecoderFallbacks and the like.
I've had to deal with that when reading HTTP requests coming in over the wire on Compact Framework. My solution was to roll my own non-buffering ASCII-only StreamReader, so that it was safe to interleave calls to both the StreamReader and the underlying Stream.