This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Best way to copy between two Stream instances - C#
I know I can use File.Copy but, I'm more interested in doing it a longer way round, just for educational purposes.
Now, the approach I want to discuss is to use a StreamReader and StreamWriter (or FileStreams).
In my mind, I would read the file (as binary) into memory and then write the file to the new location. This strikes me as potential for errors since
1) the entire file is being loaded into memory (and I wouldn't know how big the file is) and
2) it would be time consuming compared to something like copy a byte, paste a byte (which I assume is how streaming works) since we'd have to wait for the entire file to be stored in memory before pasting even begins.
So, a long way to ask, how would I stream a copy and paste job?
So, a long way to ask, how would I stream a copy and paste job?
Instead of reading either each byte separately into memory, or the whole file into memory, you'd go half-way between - create a buffer, e.g. 32K, and ask the input stream to read that much data. Write out however many bytes you've read, and then repeat until there's no more data to read. So something like:
public void Copy(Stream input, Stream output)
{
byte[] buffer = new byte[32 * 1024];
int bytesRead;
while ((bytesRead = input.Read(buffer, 0, buffer.Length)) != 0)
{
output.Write(buffer, 0, bytesRead);
}
}
This isn't as efficient as it could potentially be, as you're not reading and writing at the same time. Asynchronous IO would fix that, but make it much more complicated - and it's entirely possible that the OS will be smart enough to keep reading while you're writing anyway...
Note that you would definitely want to use Stream rather than TextReader / TextWriter unless you were actually reading text.
You can read/write the file in blocks of 1 MB (for example). MSDN has an example which shows how to do that.
Create an empty destination file
Read a buffer from the source file (lets say 1024 bytes)
Write the buffer to the destionation file
Repeat with 2 until everything is done
Related
This question already has answers here:
C# - How do I read and write a binary file?
(4 answers)
Closed 9 years ago.
The application I'm attempting to create would read the binary code of any file and create a file with the exact same binary code, creating a copy.
While writing a program that reads a file and writes it somewhere else, I was running into encoding issues, so I hypothesize that reading as straight binary will overcome this.
The file being read into the application is important, as after I get this to work I will add additional functionality to search within or manipulate the file's data as it is read.
Update:
I'd like to thank everyone that took the time to answer, I now have a working solution. Wolfwyrd's answer was exactly what I needed.
BinaryReader will handle reading the file into a byte buffer. BinaryWriter will handle dumping those bytes back out to another file. Your code will be something like:
using (var binReader = new System.IO.BinaryReader(System.IO.File.OpenRead("PATHIN")))
using (var binWriter = new System.IO.BinaryWriter(System.IO.File.OpenWrite("PATHOUT")))
{
byte[] buffer = new byte[512];
while (binReader.Read(buffer, 0, 512) != 0)
{
binWriter.Write(buffer);
}
}
Here we cycle a buffer of 512 bytes and immediately write it out to the other file. You would need to choose sensible sizes for your own buffer (there's nothing stopping you reading the entire file if it's reasonably sized). As you mentioned doing pattern matching you will need to consider the case where a pattern overlaps a buffered read if you do not load the whole file into a single byte array.
This SO Question has more details on best practices on reading large files.
Look at MemoryStream and BinaryReader/BinaryWriter:
http://www.dotnetperls.com/memorystream
http://msdn.microsoft.com/en-us/library/system.io.binaryreader.aspx
http://msdn.microsoft.com/en-us/library/system.io.binarywriter.aspx
Have a look at using BinaryReader Class
Reads primitive data types as binary values in a specific encoding.
and maybe BinaryReader.ReadBytes Method
Reads the specified number of bytes from the current stream into a
byte array and advances the current position by that number of bytes.
also BinaryWriter Class
Writes primitive types in binary to a stream and supports writing
strings in a specific encoding.
Another good example C# - Copying Binary Files
for instance, one char at a time.
using (BinaryReader writer = new BinaryWrite(File.OpenWrite("target"))
{
using (BinaryReader reader = new BinaryReader(File.OpenRead("source"))
{
var nextChar = reader.Read();
while (nextChar != -1)
{
writer.Write(Convert.ToChar(nextChar));
nextChar = reader.Read();
}
}
}
The application I'm attempting to create would read the binary code of any file and create a file with the exact same binary code, creating a copy.
Is this for academic purposes? Or do you actually just want to copy a file?
If the latter, you'll want to just use the System.IO.File.Copy method.
I am wondering why so many examples read byte arrays into streams in chucks and not all at once... I know this is a soft question, but I am interested.
I understand a bit about hardware and filling buffers can be very size dependent and you wouldn't want to write to the buffer again until it has been flushed to wherever it needs to go etc... but with the .Net platform (and other modern languages) I see examples of both. So when use which and when, or is the second an absolute no no?
Here is the thing (code) I mean:
var buffer = new byte[4096];
while (true)
{
var read = this.InputStream.Read(buffer, 0, buffer.Length);
if (read == 0)
break;
OutputStream.Write(buffer, 0, read);
}
rather than:
var buffer = new byte[InputStream.Length];
var read = this.InputStream.Read(buffer, 0, buffer.Length);
OutputStream.Write(buffer, 0, read);
I believe both are legal? So why go through all the fuss of the while loop (in whatever for you decide to structure it)?
I am playing devils advocate here as I want to learn as much as I can :)
In the first case, all you need is 4kB of memory. In the second case, you need as much memory as the input stream data takes. If the input stream is 4GB, you need 4GB.
Do you think it would be good if a file copy operation required 4GB of RAM? What if you were to prepare a disk image that's 20GB?
There is also this thing with pipes. You don't often use them on Windows, but a similar case is often seen on other operating systems. The second case waits for all data to be read, and only then writes them to the output. However, sometimes it is advisable to write data as soon as possible—the first case will start writing to the output stream as soon as the first 4kB of input is read. Think of serving web pages: it is advisable for a web server to send data as soon as possible, so that client's web browser will start rendering headers and first part of the content, not waiting for the whole body.
However, if you know that the input stream won't be bigger than 4kB, then both cases are equivalent.
Sometimes, InputStream.Length is not valid for some source, e.g from the net transport, or the buffer maybe huge, e.g read from a huge file. IMO.
It protects you from the situation where your input stream is several gigabytes long.
You have no idea how much data Read might return. This could create major performance problems if you're reading a very large file.
If you have control over the input, and are sure the size is reasonable, then you can certainly read the whole array in at once. But be especially careful if the user can supply an arbitrary input.
I'm new to programming in general (My understanding of programming concepts is still growing.). So this question is about learning, so please provide enough info for me to learn but not so much that I can't, thank you.
(I would also like input on how to make the code reusable with in the project.)
The goal of the project I'm working on consists of:
Read binary file.
I have known offsets I need to read to find a particular chunk of data from within this file.
First offset is first 4 bytes(Offset for end of my chunk).
Second offset is 16 bytes from end of file. I read for 4 bytes.(Gives size of chunk in hex).
Third offset is the 4 bytes following previous, read for 4 bytes(Offset for start of chunk in hex).
Locate parts in the chunk to modify by searching ASCII text as well as offsets.
Now I have the start offset, end offset and size of my chunk.
This should allow me to read bytes from file into a byte array and know the size of the array ahead of time.
(Questions: 1. Is knowing the size important? Other than verification. 2. Is reading part of a file into a byte array in order to change bytes and overwrite that part of the file the best method?)
So far I have managed to read the offsets from the file using BinaryReader on a MemoryStream. I then locate the chunk of data I need and read that into a byte array.
I'm stuck in several ways:
What are the best practices for binary Reading / Writing?
What's the best storage convention for the data that is read?
When I need to modify bytes how do I go about that.
Should I be using FileStream?
Since you want to both read and write, it makes sense to use the FileStream class directly (using FileMode.Open and FileAccess.ReadWrite). See FileStream on MSDN for a good overall example.
You do need to know the number of bytes that you are going to be reading from the stream. See the FileStream.Read documentation.
Fundamentally, you have to read the bytes into memory at some point if you're going to use and later modify their contents. So you will have to make an in-memory copy (using the Read method is the right way to go if you're reading a variable-length chunk at a time).
As for best practices, always dispose your streams when you're done; e.g.:
using (var stream = File.Open(FILE_NAME, FileMode.Open, FileAccess.ReadWrite))
{
//Do work with the FileStream here.
}
If you're going to do a large amount of work, you should be doing the work asynchronously. (Let us know if that's the case.)
And, of course, check the FileStream.Read documentation and also the FileStream.Write documentation before using those methods.
Reading bytes is best done by pre-allocating an in-memory array of bytes with the length that you're going to read, then reading those bytes. The following will read the chunk of bytes that you're interested in, let you do work on it, and then replace the original contents (assuming the length of the chunk hasn't changed):
EDIT: I've added a helper method to do work on the chunk, per the comments on variable scope.
using (var stream = File.Open(FILE_NAME, FileMode.Open, FileAccess.ReadWrite))
{
var chunk = new byte[numOfBytesInChunk];
var offsetOfChunkInFile = stream.Position; // It sounds like you've already calculated this.
stream.Read(chunk, 0, numOfBytesInChunk);
DoWorkOnChunk(ref chunk);
stream.Seek(offsetOfChunkInFile, SeekOrigin.Begin);
stream.Write(chunk, 0, numOfBytesInChunk);
}
private void DoWorkOnChunk(ref byte[] chunk)
{
//TODO: Any mutation done here to the data in 'chunk' will be written out to the stream.
}
Which one is better : MemoryStream.WriteTo(Stream destinationStream) or Stream.CopyTo(Stream destinationStream)??
I am talking about the comparison of these two methods without Buffer as I am doing like this :
Stream str = File.Open("SomeFile.file");
MemoryStream mstr = new MemoryStream(File.ReadAllBytes("SomeFile.file"));
using(var Ms = File.Create("NewFile.file", 8 * 1024))
{
str.CopyTo(Ms) or mstr.WriteTo(Ms);// Which one will be better??
}
Update
Here is what I want to Do :
Open File [ Say "X" Type File]
Parse the Contents
From here I get a Bunch of new Streams [ 3 ~ 4 Files ]
Parse One Stream
Extract Thousands of files [ The Stream is an Image File ]
Save the Other Streams To Files
Editing all the Files
Generate a New "X" Type File.
I have written every bit of code which is actually working correctly..
But Now I am optimizing the code to make the most efficient.
It is an historical accident that there are two ways to do the same thing. MemoryStream always had the WriteTo() method, Stream didn't acquire the CopyTo() method until .NET 4.
The MemoryStream.WriteTo() version looks like this:
public virtual void WriteTo(Stream stream)
{
// Exception throwing code elided...
stream.Write(this._buffer, this._origin, this._length - this._origin);
}
The Stream.CopyTo() implementation like this:
private void InternalCopyTo(Stream destination, int bufferSize)
{
int num;
byte[] buffer = new byte[bufferSize];
while ((num = this.Read(buffer, 0, buffer.Length)) != 0)
{
destination.Write(buffer, 0, num);
}
}
Stream.CopyTo() is more universal, it works for any stream. And helps programmers that fumble copying data from, say, a NetworkStream. Forgetting to pay attention to the return value from Read() was a very common bug. But it of course copies the bytes twice and allocates that temporary buffer, MemoryStream doesn't need it since it can write directly from its own buffer. So you'd still prefer WriteTo(). Noticing the difference isn't very likely.
MemoryStream.WriteTo: Writes the entire contents of this memory stream to another stream.
Stream.CopyTo: Reads the bytes from the current stream and writes them to the destination stream. Copying begins at the current position in the current stream.
You'll need to seek back to 0, to get the whole source stream copied.
So I think MemoryStream.WriteTo better option for this situation
If you use Stream.CopyTo, you don't need to read all the bytes into memory to start with. However:
This code would be simpler if you just used File.Copy
If you are going to load all the data into memory, you can just use:
byte[] data = File.ReadAllBytes("input");
File.WriteAllBytes("output", data);
You should have a using statement for the input as well as the output stream
If you really need processing so can't use File.Copy, using Stream.CopyTo will cope with larger files than loading everything into memory. You may not need that, of course, or you may need to load the whole file into memory for other reasons.
If you have got a MemoryStream, I'd probably use MemoryStream.WriteTo rather than Stream.CopyTo, but it probably won't make much difference which you use, except that you need to make sure you're at the start of the stream when using CopyTo.
I think Hans Passant's claim of a bug in MemoryStream.WriteTo() is wrong; it does not "ignore the return value of Write()". Stream.Write() returns void, which implies to me that the entire count bytes are written, which implies that Stream.Write() will block as necessary to complete the operation to, e.g., a NetworkStream, or throw if it ultimately fails.
That is indeed different from the write() system call in ?nix, and its many emulations in libc and so forth, which can return a "short write". I suspect Hans leaped to the conclusion that Stream.Write() followed that, which I would have expected, too, but apparently it does not.
It is conceivable that Stream.Write() could perform a "short write", without returning any indication of that, requiring the caller to check that the Position property of the Stream has actually been advanced by count. That would be a very error-prone API, and I doubt that it does that, but I have not thoroughly tested it. (Testing it would be a bit tricky: I think you would need to hook up a TCP NetworkStream with a reader on the other end that blocked forever, and write enough to fill up the wire buffers. Or something like that...)
The comments for Stream.Write() are not quite unambiguous:
Summary:
When overridden in a derived class, writes a sequence of bytes to the current
stream and advances the current position within this stream by the number
of bytes written.
Parameters: buffer:
An array of bytes. This method copies count bytes from buffer to the current stream.
Compare that to the Linux man page for write(2):
write() writes up to count bytes from the buffer pointed buf to the file referred to by the file descriptor fd.
Note the crucial "up to". That sentence is followed by explanation of some of the conditions under which a "short write" might occur, making it very explicit that it can occur.
This is really a critical issue: we need to know how Stream.Write() behaves, beyond all doubt.
The CopyTo method creates a buffer, populates its with data from the original stream and then calls the Write method passing the created buffer as a parameter. The WriteTo uses the memoryStream's internal buffer to write. That is the difference. What is better - it is up to you to decide which method you prefer.
Creating a MemoryStream from a HttpInputStream in Vb.Net:
Dim filename As String = MyFile.PostedFile.FileName
Dim fileData As Byte() = Nothing
Using binaryReader = New BinaryReader(MyFile.PostedFile.InputStream)
binaryReader.BaseStream.Position = 0
fileData = binaryReader.ReadBytes(MyFile.PostedFile.ContentLength)
End Using
Dim memoryStream As MemoryStream = New MemoryStream(fileData)
Motivated by this answer I was wondering what's going on under the curtain if one uses lots of FileStream.Seek(-1).
For clarity I'll repost the answer:
using (var fs = File.OpenRead(filePath))
{
fs.Seek(0, SeekOrigin.End);
int newLines = 0;
while (newLines < 3)
{
fs.Seek(-1, SeekOrigin.Current);
newLines += fs.ReadByte() == 13 ? 1 : 0; // look for \r
fs.Seek(-1, SeekOrigin.Current);
}
byte[] data = new byte[fs.Length - fs.Position];
fs.Read(data, 0, data.Length);
}
Personally I would have read like 2048 bytes into a buffer and searched that buffer for the char.
Using Reflector I found out that internally the method is using SetFilePointer.
Is there any documentation about windows caching and reading a file backwards? Does Windows buffer "backwards" and consult the buffer when using consecutive Seek(-1) or will it read ahead starting from the current position?
It's interesting that on the one hand most people agree with Windows doing good caching, but on the other hand every answer to "reading file backwards" involves reading chunks of bytes and operating on that chunk.
Going forward vs backward doesn't usually make much difference. The file data is read into the file system cache after the first read, you get a memory-to-memory copy on ReadByte(). That copy isn't sensitive to the file pointer value as long as the data is in the cache. The caching algorithm does however work from the assumption that you'd normally read sequentially. It tries to read ahead, as long as the file sectors are still on the same track. They usually are, unless the disk is heavily fragmented.
But yes, it is inefficient. You'll get hit with two pinvoke and API calls for each individual byte. There's a fair amount of overhead in that, those same two calls could also read, say, 65 kilobytes with the same amount of overhead. As usual, fix this only when you find it to be a perf bottleneck.
Here is a pointer on File Caching in Windows
The behavior may also depends on where physically resides the file (hard disk, network, etc.) as well as local configuration/optimization.
An also important source of information is the CreateFile API documentation: CreateFile Function
There is a good section named "Caching Behavior" that tells us at least how you can influence file caching, at least in the unmanaged world.