Getting number of disk accesses from BufferedStream

Getting number of disk accesses from BufferedStream - c#

I'm reading a binary file using BinaryReader. I want to count the number of disk accesses when buffering input with BufferedStream. Unfortunately this class is sealed, so I can't override method to count it manually.
Is there any way of doing it using standard library? Or must I write my own buffering BinaryReader to achieve this?

You could just calculate it from the buffer size you specified in the BufferedStream(Stream, int) constructor. The default is 4096 bytes. Assuming you don't Seek(), the number of file accesses is (filesize + bufsize - 1) / bufsize.
A total overkill approach is to keep in mind that you can chain streams. Create your own Stream derived class and just count the number of calls to the Read() method that need to supply data from the underlying stream. Pass an instance of that class to the BufferedStream constructor.
Neither approach lets you find out how often the operating system hits the disk driver and physically transfers data from the disk. The file system cache sits in between and the actual number greatly depends on how the file data is mapped across the disk cylinders and sectors. You'd get info about that from a performance counter. There's little point in actually using it, the numbers you get will very poorly reproduce on another machine.

Related

How FileStream works in c#?

I have this following piece of code.I an not fully understanding its implementation.
img stores the path of image as c:\\desktop\my.jpg
FileStream fs = new FileStream(img, FileMode.Open, FileAccess.Read);
byte[] bimage = new byte[fs.Length];
fs.Read(bimage, 0, Convert.ToInt32(fs.Length));
In the first line the filestream is opening image located at path img to read.
The second line is (i guess) converting the file opened to byte.
What does fs.length represent?
Does image have length or is it the length of name of file(i guess not)?
What is the third line doing?
Please help me clearify!!

fs is one of many C# I/O objects that presents file descriptor and some methods like Read in your example. Because Read method returns byte array, you should declare it first and set its length to file length (the second string, so fs.Length is file length in bytes). Then all you need is just read file content and store it in this array (third line). This could be done by one iteration (like in the example) or by reading blocks in a loop. When you done with reading, it is good approach to destroy fs object to prevent memory leakage.

A "stream", in computing, is commonly a control buffer you open (or connect), read a chunk and close (or disconnect).
In case of files, when opening OS finds file, handle pointers and locks on the resource.
You do reads. When you read, you are picking a range of bytes ("chunk") and putting it in memory. In this case, that second line byte array.
You could, in thesis, pick any number. But life is hard: you have physical memory limitation in any computer.
If your file fits in your RAM + virtual memory... you may use a large byte array (FSB and motherboard throughput applies).
So, in a low memory system, like a Raspberry Pi B (512MB), this can cause errors or failures.
There is where goes fs.Length. Microsoft implemented it to count all the bytes in the file. It iterates, counting every byte till the end of file (EOF).
Knowing this, a fs.Length call will be faster in smaller files, slower on bigger ones and allows you to do some math for your byte array maximum size versus optimum size (hardware power versus file chunk size).
Your buffers shall consider max computer memory and processes running (and using memory) in parallel.
You SHOULD NEVER, in any plataform, rely only on file size to define your memory buffer size.
And remember to always close/disconnect/release/dispose locked I/O resources... Like TCP connections, files, consoles, database connections and thread-safety locks.
Imagine you read a file with 10 GB, as a payment transaction log file in a 512 MB + 2GB SD Raspberry Pi.

Add data to a zip archive until it reaches a given size

I'm trying to zip some data, but also break up the data set into multiple archives so that no single zip file ends up being larger than some maximum.
Since my data is not sourced from the file system it seems a good idea to use a streaming approach. I thought I could simply write one atomic piece of data at a a time while keeping track of the stream position prior to writing each piece. Once I exceed the limit, I truncate the stream to the position before writing the piece that can't fit, and move on to create the next archive.
I've attempted with the classes in System.IO.Compression - create an archive, create an entry, use ZipArchiveEntry.Open to get a stream, and write to that stream. The problem is it seems impossible to get from this how large the archive is at any point.
I can read the stream's position, but this is tracking uncompressed bytes. Truncating the stream works fine too, so I have this working as intended now with the important exception that the limit applies to how much uncompressed data there is per archive rather than how large the compressed archive becomes.
The data is part compressible text and various blobs (attachments from end users) that will sometimes be very compressible and sometimes not at all.
My questions:
1) Is there something about the deflate algorithm that inherently conflicts with my approach? I know it is a block-based compression scheme and I imagine the algorithm may not decide how to encode the compressed data until the entire archive has been specified.
2) If the answer to (1) above is "yes", what is a good strategy that doesn't introduce far too much overhead?
One idea I have is to assume that compressed data won't be larger than uncompressed data. I can then write to the stream until uncompressed data exceeds the threshold, and then save the archive, calculate the difference between the threshold and the current size, and repeat until full.
In case that wasn't clear, say the limit is 1MB. I write 1 MB of uncompressed data and save the archive. I then see that the resulting archive is 0.3MB. I open the archive (and its only entry) again and start over with a new limit of 0.7 MB, since I know I am able to add at least that much uncompressed data to it without overshooting. I imagine this approach is relatively simple to implement, and will test it, but am interested to hear if anyone has better ideas.

You can find out a lower bound on how big the compressed data is by looking at the Length or Position of the underlying FileStream. You can then decide to stop adding entries. The ZIP stream classes tend to buffer not too much. Probably on the order of 64KB.
Truncating the archive at a certain point should be possible. Try flushing the ZIP stream before measuring the Position of the base stream. This is always possible in theory but the actual library you are using might not support it. Test it or look at the source.

Combining FileStream and MemoryStream to avoid disk accesses/paging while receiving gigabytes of data?

I'm receiving a file as a stream of byte[] data packets (total size isn't known in advance) that I need to store somewhere before processing it immediately after it's been received (I can't do the processing on the fly). Total received file size can vary from as small as 10 KB to over 4 GB.
One option for storing the received data is to use a MemoryStream, i.e. a sequence of MemoryStream.Write(bufferReceived, 0, count) calls to store the received packets. This is very simple, but obviously will result in out of memory exception for large files.
An alternative option is to use a FileStream, i.e. FileStream.Write(bufferReceived, 0, count). This way, no out of memory exceptions will occur, but what I'm unsure about is bad performance due to disk writes (which I don't want to occur as long as plenty of memory is still available) - I'd like to avoid disk access as much as possible, but I don't know of a way to control this.
I did some testing and most of the time, there seems to be little performance difference between say 10 000 consecutive calls of MemoryStream.Write() vs FileStream.Write(), but a lot seems to depend on buffer size and the total amount of data in question (i.e the number of writes). Obviously, MemoryStream size reallocation is also a factor.
Does it make sense to use a combination of MemoryStream and FileStream, i.e. write to memory stream by default, but once the total amount of data received is over e.g. 500 MB, write it to FileStream; then, read in chunks from both streams for processing the received data (first process 500 MB from the MemoryStream, dispose it, then read from FileStream)?
Another solution is to use a custom memory stream implementation that doesn't require continuous address space for internal array allocation (i.e. a linked list of memory streams); this way, at least on 64-bit environments, out of memory exceptions should no longer be an issue. Con: extra work, more room for mistakes.
So how do FileStream vs MemoryStream read/writes behave in terms of disk access and memory caching, i.e. data size/performance balance. I would expect that as long as enough RAM is available, FileStream would internally read/write from memory (cache) anyway, and virtual memory would take care of the rest. But I don't know how often FileStream will explicitly access a disk when being written to.
Any help would be appreciated.

No, trying to optimize this doesn't make any sense. Windows itself already caches file writes, they are buffered by the file system cache. So your test is about accurate, both MemoryStream.Write() and FileStream.Write() actually write to RAM and have no significant perf differences. The file system driver lazily writes it to disk in the background.
The RAM used for the file system cache is what's left over after processes claimed their RAM needs. By using a MemoryStream, you reduce the effectiveness of the file system cache. Or in other words, you trade one for the other without benefit. You're in fact off worse, you use double the amount of RAM.
Don't help, this is already heavily optimized inside the operating system.

Since recent versions of Windows enable write caching by default, I'd say you could simply use FileStream and let Windows manage when or if anything actually is written to the physical hard drive.
If these files don't stick around after you've received them, you should probably write the files to a temp directory and delete them when you're done with them.

Use a FileStream constructor that allows you to define the buffer size. For example:
using (outputFile = new FileStream("filename",
FileMode.Create, FileAccess.Write, FileShare.None, 65536))
{
}
The default buffer size is 4K. Using a 64K buffer reduces the number of calls to the file system. A larger buffer will reduce the number of writes, but each write starts to take longer. Emperical data (many years of working with this stuff) indicates that 64K is a very good choice.
As somebody else pointed out, the file system will likely do further caching, and do the actual disk write in the background. It's highly unlikely that you'll receive data faster than you can write it to a FileStream.

difference between memory stream and filestream

During the serialization we can use either memory stream or file stream.
What is the basic difference between these two? What does memory stream mean?
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Runtime.Serialization.Formatters.Binary;
namespace Serilization
{
class Program
{
static void Main(string[] args)
{
MemoryStream aStream = new MemoryStream();
BinaryFormatter aBinaryFormat = new BinaryFormatter();
aBinaryFormat.Serialize(aStream, person);
aStream.Close();
}
}
}

Stream is a representation of bytes. Both these classes derive from the Stream class which is abstract by definition.
As the name suggests, a FileStream reads and writes to a file whereas a MemoryStream reads and writes to the memory. So it relates to where the stream is stored.
Now it depends how you plan to use both of these. For eg: Let us assume you want to read binary data from the database, you would go in for a MemoryStream. However if you want to read a file on your system, you would go in for a FileStream.
One quick advantage of a MemoryStream is that there is not need to create temporary buffers and files in an application.

The other answers here are great, but I thought one that takes a really high level look at what purpose steams serve might be useful. There's a bit of simplification going on in the explanation below, but hopefully this gets the idea across:
What is a stream?
A stream is effectively the flow of data between two places, it's the pipe rather than the contents of that pipe.
A bad analogy to start
Imagine a water desalination plant (something that takes seawater, removes the salt and outputs clean drinking water to the water network):
The desalination plant can't remove the salt from all of the sea at one time (and nor would we want it to… where would the saltwater fish live?), so instead we have:
A SeaStream that sucks a set amount of water at a time into the plant.
That SeaStream is connected to the DesalinationStream to remove the salt
And the output of the DesalinationStream is connected to the DrinkingWaterNetworkStream to output the now saltless water to the drinking water supply.
OK, so what's that got to do with computers?
Moving big files all at once can be problematic
Frequently in computing we want to move data between two locations, e.g. from an external hard drive to a binary field in a database (to use the example given in another answer). We can do that by copying all of the data from the file from location A into the computer's memory and from there to to Location B, but if the file is large or the source or destination are potentially unreliable then moving the whole file at once may be either unfeasible or unwise.
For example, say we want to move a large file on a USB stick to a field in a database. We could use a 'System.IO.File' object to retrieve that whole file into the computer's memory and then use a database connection to pass that file onto the database.
But, that's potentially problematic, what if the file is larger than the computer's available RAM? Now the file will potentially be cached to the hard drive, which is slow, and it might even slow the computer down too.
Likewise, what if the data source is unreliable, e.g. copying a file from a network drive with a slow and flaky WiFi connection? Trying to copy a large file in one go can be infuriating because you get half the file and then the connection drops out and you have to start all over again, only for it to potentially fail again.
It can be better to split the file and move it a piece at a time
So, rather than getting whole file at once, it would be better to retrieve the file a piece at a time and pass each piece on to the destination one at a time. This is what a Stream does and that's where the two different types of stream you mentioned come in:
We can use a FileStream to retrieve data from a file a piece at a time
and the database API may make available a MemoryStream endpoint we can write to a piece at a time.
We connect those two 'pipes' together to flow the file pieces from file to database.
Even if the file wasn't too big to be held in RAM, without streams we were still doing a number or read/write operations that we didn't need to. The stages we we're carrying out were:
Retrieving the data from the disk (slow)
Writing to a File object in the computer's memory (a bit faster)
Reading from that File object in the computer's memory (faster again)
Writing to the database (probably slow as there's probably a spinning disk hard-drive at the end of that pipe)
Streams allow us to conceptually do away with the middle two stages, instead of dragging the whole file into computer memory at once, we take the output of the operation to retrieve the data and pipe that straight to the operation to pass the data onto the database.
Other benefits of streams
Separating the retrieval of the data from the writing of the data like this also allows us to perform actions between retrieving the data and passing it on. For example, we could add an encryption stage, or we could write the incoming data to more than one type of output stream (e.g. to a FileStream and a NetworkStream).
Streams also allow us to write code where we can resume the operation should the transfer fail part way through. By keeping track of the number of pieces we've moved, if the transfer fails (e.g. if the network connection drops out) we can restart the Stream from the point at which we received the last piece (this is the offset in the BeginRead method).

In simplest form, a MemoryStream writes data to memory, while a FileStream writes data to a file.
Typically, I use a MemoryStream if I need a stream, but I don't want anything to hit the disk, and I use a FileStream when writing a file to disk.

While a file stream reads from a file, a memory stream can be used to read data mapped in the computer's internal memory (RAM). You are basically reading/writing streams of bytes from memory.

Having a bitter experience on the subject, here's what I've found out. if performance is required, you should copy the contents of a filestream to a memorystream. I had to process the contents of 144 files, 528kbytes each and present the outcome to the user. It took 250 seconds aprox. (!!!!). When I just copied the contents of each filestream to a memorystream, (CopyTo method) without changing anything at all, the time dropped to approximately 32 seconds. Note that each time you copy one stream to another, the stream is appended at the end of the destination stream, so you may need to 'rewind' it prior to copying to it. Hope it helps.

In regards to stream itself, in general, it means that when you put a content into the stream (memory), it will not put all the content of whatever data source (file, db...) you are working with, to the memory. As opposed to for example Arrays or Buffers, where you feed everything to the memory. In stream, you get a chunk of eg. file to the memory. When you reach the end of chunk, stream gets the next chunk from file to the memory. It all happens in low-level background while you are just iterating the stream. That's why it's called stream.

A memory stream handles data via an in memory buffer. A filestream deals with files on disk.

Serializing objects in memory is hardly any useful, in my opinion. You need to serialize an object when you want to save it on disk. Typically, serialization is done from the object(which is in memory) to the disk while deserialization is done from the saved serialized object(on the disk) to the object(in memory).
So, most of the times, you want to serialize to disk, thus you use a Filestream for serialization.

Larger File Streams using C#

There are some text files(Records) which i need to access using C#.Net. But the matter is those files are larger than 1GB. (minimum size is 1 GB)
what should I need to do?
What are the factors which I need to be concentrate on?
Can some one give me an idea to over come from this situation.
EDIT:
Thanks for the fast responses. yes they are fixed length records. These text files coming from a local company. (There last month transaction records)
Is it possible to access these files like normal text files (using normal file stream).
and
How about the memory management????

Expanding on CasperOne's answer
Simply put there is no way to reliably put a 100GB file into memory at one time. On a 32 bit machine there is simply not enough addressing space. In a 64 bit machine there is enough addressing space but during the time in which it would take to actually get the file in memory, your user will have killed your process out of frustration.
The trick is to process the file incrementally. The base System.IO.Stream() class is designed to process a variable (and possibly infinite) stream in distinct quantities. It has several Read methods that will only progress down a stream a specific number of bytes. You will need to use these methods in order to divide up the stream.
I can't give more information because your scenario is not specific enough. Can you give us more details or your record delimeters or some sample lines from the file?
Update
If they are fixed length records then System.IO.Stream will work just fine. You can even use File.Open() to get access to the underlying Stream object. Stream.Read has an overload that requests the number of bytes to be read from the file. Since they are fixed length records this should work well for your scenario.
As long as you don't call ReadAllText() and instead use the Stream.Read() methods which take explicit byte arrays, memory won't be an issue. The underlying Stream class will take care not to put the entire file into memory (that is of course, unless you ask it to :) ).

You aren't specifically listing the problems you need to overcome. A file can be 100GB and you can have no problems processing it.
If you have to process the file as a whole then that is going to require some creative coding, but if you can simply process sections of the file at a time, then it is relatively easy to move to the location in the file you need to start from, process the data you need to process in chunks, and then close the file.
More information here would certainly be helpful.

What are the main problems you are having at the moment? The big thing to remember is to think in terms of streams - i.e. keep the minimum amount of data in memory that you can. LINQ is excellent at working with sequences (although there are some buffering operations you need to avoid, such as OrderBy).
For example, here's a way of handling simple records from a large file efficiently (note the iterator block).
For performing multiple aggregates/analysis over large data from files, consider Push LINQ in MiscUtil.
Can you add more context to the problems you are thinking of?

Expanding on JaredPar's answer.
If the file is a binary file (i.e. ints stored as 4 bytes, fixed length strings etc) you can use the BinaryReader class. Easier than pulling out n bytes and then trying to interrogate that.
Also note, the read method on System.IO.Stream is a non blocking operation. If you ask for 100 bytes it may return less than that, but still not have reached end of file.
The BinaryReader.ReadBytes method will block until it reads the requested number of bytes, or End of file - which ever comes first.
Nice collaboration lads :)

Hey Guys, I realize that this post hasn't been touched in a while, but I just wanted to post a site that has the solution to your problem.
http://thedeveloperpage.wordpress.com/c-articles/using-file-streams-to-write-any-size-file-introduction/
Hope it helps!
-CJ

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.