FileStream.Read() - bytes read - c#

FileStream.Read() returns the amount of bytes read, but... is there any situation other than having reached the end of file, that it will read less bytes than the number of bytes requested and not throw an exception?
the documentation says:
The Read method returns zero only after reaching the end of the stream. Otherwise, Read always reads at least one byte from the stream before returning. If no data is available from the stream upon a call to Read, the method will block until at least one byte of data can be returned. An implementation is free to return fewer bytes than requested even if the end of the stream has not been reached.
But this doesn't quite explain in what situations data would be unavailable and cause the method to block until it can read again. I mean, shouldn't most situations where data is unavailable force an exception?
What are real situations where comparing the number of bytes read against the number of expected bytes could differ (assuming that we're already checking for end of file when we mention number of bytes expected)?
EDIT: A bit more information, reason why I'm asking this is because I've come across a bit of code where the developer pretty much did something like this:
bytesExpected = (remainingBytesInFile > 94208 ? 94208 : remainingBytesInFile
while (bytesRead < bytesExpected)
{
bytesRead += fileStream.Read(buffer, bytesRead, bytesExpected - bytesRead)
}
Now, I can't see any advantage to having this while at all, I'd expect it to throw an exception if it can't read the number of bytes expected (bearing in mind it's already taking into account that there are those many bytes left to read)
What would the reason one could possibly have for something like this? I'm sure I'm missing something

The documentation is for Stream.Read, from which FileStream is derived. Since FileStream is a stream, it should obey the stream contract. Not all streams do, but unless you have a very good reason, you should stick to that.
In a typical file stream, you'll only get a return value smaller than count when you reach the end of file (and it's a pretty simple way of checking for the end of file).
However, in a NetworkStream, for example, you keep reading in a loop until the method returns zero - signalling the end of stream. The same works for file streams - you know you're at the end of the file when Read returns zero.
Most importantly, FileStream isn't just for what you'd consider files - it's also for pseudo-files like standard input/output pipes and COM ports, for example (try opening a file stream on PRN, for example). In that case, you're not reading a file with a fixed length, and the behaviour is the same as with NetworkStream.
Finally, don't forget that FileStream isn't sealed. It's perfectly fine for you to implement a virtualized file system, for example - and it's perfectly fine if your virtualized file system doesn't support seeking, or checking the length of file.
EDIT:
To address your edit, this is exactly how you're supposed to read any stream. Nothing wrong with it. If there's nothing else to read in a stream, the Read method will simply return 0, and you know the stream is over. The only thing is, it seems that he tries to fill his buffer to full, one buffer at a time - this only makes sense if you explicitly need to partition the file by 94208 bytes, and pass that byte[] for further processing somewhere.
If that's not the case, you don't really need to fill the full buffer - you just keep reading (and probably writing on some other side) until Read returns 0. And indeed, by default, FileStream will always fill the whole buffer unless it's built around a pipe handle - but since that's a possibility, you shouldn't rely on the "real file" behaviour, so as long as you need those byte[] for something non-stream (e.g. parsing messages), this is entirely fine. If you're only using the stream as an actual stream, and you're streaming the data somewhere else, it doesn't have a point, really - you only need one while to read the file.

Your expectations would only apply to the case when the stream is reading data off of a no-latency source. Other I/O sources can be slow, which is why the Read method might will not always be able to return immediately. That doesn't mean that there is an error (so no exception), just that it has to wait for data to arrive.
Examples: network stream, file stream on slow disk, etc.
(UPDATE, HDD example) To give an example specific to files (since your case is FileStream, although Read is defined on Stream and so all implementations should fulfill the requirements): mechanical hard-drives go to "sleep" when not active (specially on battery-powered devices, read laptops). Spinning up can take a second or so. That is not an IOException, but your read would have to wait for a second before any data is read.

Simple answer is that on a FileStream it probably never happens.
However keep in mind that the Read method is inherited from Stream which serves as base for many other streams like NetworkStream and in this case you may not be able to read has many bytes as you requested simple because they havent been received from the network yet.
So like the documentation says it all depends on the implementation of the specific type of stream - FileStream, NetworkStream, etc.

Related

Why write to Stream in chunks?

I am wondering why so many examples read byte arrays into streams in chucks and not all at once... I know this is a soft question, but I am interested.
I understand a bit about hardware and filling buffers can be very size dependent and you wouldn't want to write to the buffer again until it has been flushed to wherever it needs to go etc... but with the .Net platform (and other modern languages) I see examples of both. So when use which and when, or is the second an absolute no no?
Here is the thing (code) I mean:
var buffer = new byte[4096];
while (true)
{
var read = this.InputStream.Read(buffer, 0, buffer.Length);
if (read == 0)
break;
OutputStream.Write(buffer, 0, read);
}
rather than:
var buffer = new byte[InputStream.Length];
var read = this.InputStream.Read(buffer, 0, buffer.Length);
OutputStream.Write(buffer, 0, read);
I believe both are legal? So why go through all the fuss of the while loop (in whatever for you decide to structure it)?
I am playing devils advocate here as I want to learn as much as I can :)
In the first case, all you need is 4kB of memory. In the second case, you need as much memory as the input stream data takes. If the input stream is 4GB, you need 4GB.
Do you think it would be good if a file copy operation required 4GB of RAM? What if you were to prepare a disk image that's 20GB?
There is also this thing with pipes. You don't often use them on Windows, but a similar case is often seen on other operating systems. The second case waits for all data to be read, and only then writes them to the output. However, sometimes it is advisable to write data as soon as possible—the first case will start writing to the output stream as soon as the first 4kB of input is read. Think of serving web pages: it is advisable for a web server to send data as soon as possible, so that client's web browser will start rendering headers and first part of the content, not waiting for the whole body.
However, if you know that the input stream won't be bigger than 4kB, then both cases are equivalent.
Sometimes, InputStream.Length is not valid for some source, e.g from the net transport, or the buffer maybe huge, e.g read from a huge file. IMO.
It protects you from the situation where your input stream is several gigabytes long.
You have no idea how much data Read might return. This could create major performance problems if you're reading a very large file.
If you have control over the input, and are sure the size is reasonable, then you can certainly read the whole array in at once. But be especially careful if the user can supply an arbitrary input.

how do you account for when TCP does not get all the bytes in one read

I just read an article that says TCPClient.Read() may not get all the sent bytes in one read. How do you account for this?
For example, the server can write a string to the tcp stream. The client reads half of the string's bytes, and then reads the other half in another read call.
how do you know when you need to combine the byte arrays received in both calls?
how do you know when you need to combine the byte arrays received in both calls?
You need to decide this at the protocol level. There are four common models:
Close-on-finish: each side can only send a single "message" per connection. After sending the message, they close the sending side of the socket. The receiving side keeps reading until it reaches the end of the stream.
Length-prefixing: Before each message, include the number of bytes in the message. This could be in a fixed-length format (e.g. always 4 bytes) or some compressed format (e.g. 7 bits of size data per byte, top bit set for the final byte of size data). Then there's the message itself. The receiving code will read the size, then read that many bytes.
Chunking: Like length-prefixing, but in smaller chunks. Each chunk is length-prefixed, with a final chunk indicating "end of message"
End-of-message signal: Keep reading until you see the terminator for the message. This can be a pain if the message has to be able to include arbitrary data, as you'd need to include an escaping mechanism in order to represent the terminator data within the message.
Additionally, less commonly, there are protocols where each message is always a particular size - in which case you just need to keep going until you've read that much data.
In all of these cases, you basically need to loop, reading data into some sort of buffer until you've got enough of it, however you determine that. You should always use the return value of Read to note how many bytes you actually read, and always check whether it's 0, in which case you've reached the end of the stream.
Also note that this doesn't just affect network streams - for anything other than a local MemoryStream (which will always read as much data as you ask for in one go, if it's in the stream at all), you should assume that data may only become available over the course of multiple calls.
You should call read() in a loop. The condition of that loop would check if there is still any data available to be read.
That is kinda hard to answer, because you can never know when data will arrive, and thats why I usually use a thread for receiving data in my chat program. But you should be able to use something similar to this:
do{
numberOfBytesRead = myNetworkStream.Read(myReadBuffer,
0,
myReadBuffer.Length);
myCompleteMessage.AppendFormat("{0}",
Encoding.ASCII.GetString(myReadBuffer, 0, numberOfBytesRead));
}
while(myNetworkStream.DataAvailable);
Look at this source!

MemoryStream.WriteTo(Stream destinationStream) versus Stream.CopyTo(Stream destinationStream)

Which one is better : MemoryStream.WriteTo(Stream destinationStream) or Stream.CopyTo(Stream destinationStream)??
I am talking about the comparison of these two methods without Buffer as I am doing like this :
Stream str = File.Open("SomeFile.file");
MemoryStream mstr = new MemoryStream(File.ReadAllBytes("SomeFile.file"));
using(var Ms = File.Create("NewFile.file", 8 * 1024))
{
str.CopyTo(Ms) or mstr.WriteTo(Ms);// Which one will be better??
}
Update
Here is what I want to Do :
Open File [ Say "X" Type File]
Parse the Contents
From here I get a Bunch of new Streams [ 3 ~ 4 Files ]
Parse One Stream
Extract Thousands of files [ The Stream is an Image File ]
Save the Other Streams To Files
Editing all the Files
Generate a New "X" Type File.
I have written every bit of code which is actually working correctly..
But Now I am optimizing the code to make the most efficient.
It is an historical accident that there are two ways to do the same thing. MemoryStream always had the WriteTo() method, Stream didn't acquire the CopyTo() method until .NET 4.
The MemoryStream.WriteTo() version looks like this:
public virtual void WriteTo(Stream stream)
{
// Exception throwing code elided...
stream.Write(this._buffer, this._origin, this._length - this._origin);
}
The Stream.CopyTo() implementation like this:
private void InternalCopyTo(Stream destination, int bufferSize)
{
int num;
byte[] buffer = new byte[bufferSize];
while ((num = this.Read(buffer, 0, buffer.Length)) != 0)
{
destination.Write(buffer, 0, num);
}
}
Stream.CopyTo() is more universal, it works for any stream. And helps programmers that fumble copying data from, say, a NetworkStream. Forgetting to pay attention to the return value from Read() was a very common bug. But it of course copies the bytes twice and allocates that temporary buffer, MemoryStream doesn't need it since it can write directly from its own buffer. So you'd still prefer WriteTo(). Noticing the difference isn't very likely.
MemoryStream.WriteTo: Writes the entire contents of this memory stream to another stream.
Stream.CopyTo: Reads the bytes from the current stream and writes them to the destination stream. Copying begins at the current position in the current stream.
You'll need to seek back to 0, to get the whole source stream copied.
So I think MemoryStream.WriteTo better option for this situation
If you use Stream.CopyTo, you don't need to read all the bytes into memory to start with. However:
This code would be simpler if you just used File.Copy
If you are going to load all the data into memory, you can just use:
byte[] data = File.ReadAllBytes("input");
File.WriteAllBytes("output", data);
You should have a using statement for the input as well as the output stream
If you really need processing so can't use File.Copy, using Stream.CopyTo will cope with larger files than loading everything into memory. You may not need that, of course, or you may need to load the whole file into memory for other reasons.
If you have got a MemoryStream, I'd probably use MemoryStream.WriteTo rather than Stream.CopyTo, but it probably won't make much difference which you use, except that you need to make sure you're at the start of the stream when using CopyTo.
I think Hans Passant's claim of a bug in MemoryStream.WriteTo() is wrong; it does not "ignore the return value of Write()". Stream.Write() returns void, which implies to me that the entire count bytes are written, which implies that Stream.Write() will block as necessary to complete the operation to, e.g., a NetworkStream, or throw if it ultimately fails.
That is indeed different from the write() system call in ?nix, and its many emulations in libc and so forth, which can return a "short write". I suspect Hans leaped to the conclusion that Stream.Write() followed that, which I would have expected, too, but apparently it does not.
It is conceivable that Stream.Write() could perform a "short write", without returning any indication of that, requiring the caller to check that the Position property of the Stream has actually been advanced by count. That would be a very error-prone API, and I doubt that it does that, but I have not thoroughly tested it. (Testing it would be a bit tricky: I think you would need to hook up a TCP NetworkStream with a reader on the other end that blocked forever, and write enough to fill up the wire buffers. Or something like that...)
The comments for Stream.Write() are not quite unambiguous:
Summary:
When overridden in a derived class, writes a sequence of bytes to the current
stream and advances the current position within this stream by the number
of bytes written.
Parameters: buffer:
An array of bytes. This method copies count bytes from buffer to the current stream.
Compare that to the Linux man page for write(2):
write() writes up to count bytes from the buffer pointed buf to the file referred to by the file descriptor fd.
Note the crucial "up to". That sentence is followed by explanation of some of the conditions under which a "short write" might occur, making it very explicit that it can occur.
This is really a critical issue: we need to know how Stream.Write() behaves, beyond all doubt.
The CopyTo method creates a buffer, populates its with data from the original stream and then calls the Write method passing the created buffer as a parameter. The WriteTo uses the memoryStream's internal buffer to write. That is the difference. What is better - it is up to you to decide which method you prefer.
Creating a MemoryStream from a HttpInputStream in Vb.Net:
Dim filename As String = MyFile.PostedFile.FileName
Dim fileData As Byte() = Nothing
Using binaryReader = New BinaryReader(MyFile.PostedFile.InputStream)
binaryReader.BaseStream.Position = 0
fileData = binaryReader.ReadBytes(MyFile.PostedFile.ContentLength)
End Using
Dim memoryStream As MemoryStream = New MemoryStream(fileData)

stream reliability

I came across a few posts that where claming that streams are not a reliable data structure, meaning that read/write operations might not follow through in all cases.
So:
a) Is there any truth to this consensus?
b) If so what are the cases in wich read/write operaitions might fail?
This consensus on streams which I came across claims that you sould loop through read/write operations until complete:
var bytesRead = 0;
var _packet = new byte[8192];
while ((bytesRead += file_reader.Read(_packet, bytesRead, _packet.Length - bytesRead)) < _packet.Length) ;
Well, it depends on what operation you're talking about, and on what layer you consider it a failure.
For instance, if you attempt to read past the end of a stream (ie. read 1000 bytes from a file that only contains 100 bytes, or read a 1000 bytes from a position that is closer to the end of the file than 1000), you will get fewer bytes left. The stream read methods returns the number of bytes they actually managed to read, so you should check that value.
As for write operations, writing to a file might fail if the disk is full, or other similar problems, but in case of write operations you'll get back an exception.
If you're writing to sockets or other network streams, there is no guarantee that even if the Write method returns without exceptions, that the other end is able to receive it, there's a ton of problems that can go wrong along the way.
However, to alleviate your concerns, streams by themselves are not unreliable.
The medium they talk to, however, can be.
If you follow the documentation, catch and act accordingly when errors occur, I think you'll find streams are pretty much bullet-proof. There's a significant amount of code that has invested in this reliability.
Can we see links to those who claim otherwise? Either you've misunderstood, or they are incorrect.

precautions for reading from a memorystream in c#

I recently came across this web page http://www.yoda.arachsys.com/csharp/readbinary.html explaining what precautions to take when reading from a filestream. The gist of it is that the following code doesnt always work:
// Bad code! Do not use!
FileStream fs = File.OpenRead(filename);
byte[] data = new byte[fs.Length];
fs.Read (data, 0, data.Length);
This is dangerous as the third argument for Read is a maximum of bytes to be read, and you should use Read's return value to check how much actually got read.
My question is should you take the same precautions when reading from a memorystream and under which circumstances might Read return before all bytes are read?
Well, I believe the current implementation of MemoryStream will always fill the buffer if it can - unless you've got some evil class derived from it. It's not guaranteed though, as far as I can see. The documentation even contains the warning:
An implementation is free to return fewer bytes than requested even if the end of the stream has not been reached.
Personally, I'd always code this defensively unless it makes things much easier. You never know when someone will change the type of stream and not notice what's happened.
Normally with a MemoryStream though, I want all the bytes at once: so I call MemoryStream.ToArray. This is guaranteed to work, and if someone changes the code to not use a MemoryStream, it will fail to compile as that member's only on MemoryStream. For general streams, I use a utility method which reads fully from a stream and returns a byte array.
I cant think of any reason for a normal MemoryStream. Unmanaged might be a different story.
Anyways, the GetBuffer() ToArray() command is always handy. :)
Yes, you should always be aware of how many bytes were actually read from a stream when calling Read. The roout cause can vary depending on the stream type, but essentially the return value will be less than the actual buffer size whenever you are trying to read beyond the end of the stream.
Here's what MSDN says about it:
...can be less than the number of
bytes requested if that number of
bytes are not currently available, or
zero if the end of the stream is
reached before any bytes are read.
and
An implementation is free to return
fewer bytes than requested even if the
end of the stream has not been
reached.
Note the term "an implementation...".

Categories