Getting the file size of a file in C#

Getting the file size of a file in C# - c#

I want to write a file back to the client using a custom handler in ASP.Net and wondering what it the best way to do this with the least processing time. At the moment I have 2 different versions of the code that do the same thing, but because the handler will be used a lot, I'm wondering what the most efficient way is.
Load the complete file to a byte array and use BinaryWrite to write the file:
string filePath = context.Server.MapPath(context.Request.Url.LocalPath);
Byte[] swfFile = File.ReadAllBytes(filePath);
context.Response.AppendHeader("content-length", Utils.MakeString(swfFile.Length));
context.Response.ContentType = Utils.GetMimeType(Path.GetExtension(filePath));
context.Response.BinaryWrite(swfFile);
Using a FileInfo object to determine the file length and TransmitFile to write the file:
string filePath = context.Server.MapPath(context.Request.Url.LocalPath);
FileInfo fileInfo = new FileInfo(filePath);
context.Response.AppendHeader("content-length", Utils.MakeString(fileInfo.Length));
context.Response.ContentType = Utils.GetMimeType(Path.GetExtension(filePath));
context.Response.TransmitFile(filePath);
I would suspect the TransmitFile method is the most efficient because it writes without buffering the file. What about the FileInfo object? How does it calculate the file size? And is a FileInfo object the best way to do this or is there an better way?

FileInfo asks the filesystem for the file size information (it does not need to read all of the contents to do so of course). This is usually considered an "expensive" operation (in comparison to manipulating stuff in memory and calling methods) because it hits the disk.
However, while it's true that cutting down on disk accesses is a good thing, when you are preparing to read the full contents of a file anyway this won't make a difference in the grand scheme of things. So the performance of FileInfo itself is what you should be focusing on.
The #1 performance issue here is that the first approach keeps the whole file in memory for as long as the client takes to download it -- this can be a huge problem, as (depending on the size of the files and the throughput of the client connection) it has the potential of massively increasing the memory usage of your application. And if this increased memory usage leads to swapping (i.e. hitting the disk) performance will instantly tank.
So what you should do is use TransmitFile -- not because it's faster when going to the disk (it may or may not be), but because it uses less memory.

Related

C# Reuse StreamWriter or FileStream but change destination file

A little background...
Everything I'm about to describe up to my implementation of the StreamWriter is business processes which I cannot change.
Every month I pull around 200 different tables of data into individual files.
Each file contains roughly 400,000 lines of business logic details for upwards of 5,000-6,000 different business units.
To effectively use that data with the tools on hand, I have to break down those files into individual files for each business unit...
200 files x 5000 business units per file = 100,000 different files.
The way I've BEEN doing it is the typical StreamWriter loop...
foreach(string SplitFile in List<BusinessFiles>)
{
using (StreamWriter SW = new StreamWriter(SplitFile))
{
foreach(var BL in g)
{
string[] Split1 = BL.Split(',');
SW.WriteLine("{0,-8}{1,-8}{2,-8}{3,-8}{4,-8}{5,-8}{6,-8}{7,-8}{8,-8}{9,-16}{10,-1}",
Split1[0], Split1[1], Split1[2], Split1[3], Split1[4], Split1[5], Split1[6], Split1[7], Split1[8], Convert.ToDateTime(Split1[9]).ToString("dd-MMM-yyyy"), Split1[10]);
}
}
}
The issue with this is, It takes an excessive amount of time.
Like, It can take 20 mins to process all the files sometimes.
Profiling my code shows me that 98% of the time spent is on the system disposing of the StreamWriter after the program leaves the loop.
So my question is......
Is there a way to keep the underlying Stream open and reuse it to write a different file?
I know I can Flush() the Stream but I can't figure out how to get it to start writing to another file altogether. I can't seem to find a way to change the destination filename without having to call another StreamWriter.
Edit:
A picture of what it shows when I profile the code
Edit 2:
So after poking around a bit more I started looking at it a different way.
First thing is, I already had the reading of the one file and writing of the massive amount of smaller files in a nested parallel loop so I was essentially maxing out my I/O as is.
I'm also writing to an SSD, so all those were good points.
Turns out I'm reading the 1 massive file and writing ~5600 smaller ones every 90 seconds or so.
That's 60 files a second. I guess I can't really ask for much more than that.

This sounds about right. 100,000 files in 20 minutes is more than 83 files every second. Disk I/O is pretty much the slowest thing you can do within a single computer. All that time in the Dispose() method is waiting for the buffer to flush out to disk while closing the file... it's the actual time writing the data to your persistent storage, and a separate using block for each file is the right way to make sure this is done safely.
To speed this up it's tempting to look at asynchronous processing (async/await), but I don't think you'll find any gains there; ultimately this is an I/O-bound task, so optimizing for your CPU scheduling might even make things worse. Better gains could be available if you can change the output to write into a single (indexed) file, so the operating system's disk buffering mechanism can be more efficient.

I would agree with Joel that the time is mostly due to writing the data out to disk. I would however be a little bit more optimistic about doing parallel IO, since SSDs are better able to handle higher loads than regular HDDs. So I would try a few things:
1. Doing stuff in parallel
Change your outer loop to a parallel one
Parallel.ForEach(
myBusinessFiles,
new ParallelOptions(){MaxDegreeOfParallelism = 2},
SplitFile => {
// Loop body
});
Try changing the degree of parallelism to see if performance improves or not. This assumes the data is thread safe.
2. Try writing high speed local SSD
I'm assuming you are writing to a network folder, this will add some additional latency, so you might try to write to a local disk. If you are already doing that, consider getting a faster disk. If you need to move all the files to a network drive afterwards, you will likely not gain anything, but it can give an idea about the penalty you get from the network.
3. Try writing to a Zip Archive
There are zip archives that can contain multiple files inside it, while still allowing for fairly easy access of an individual file. This could help improve performance in a few ways:
Compression. I would assume your data is fairly easy to compress, so you would write less data overall.
Less file system operations. Since you are only writing to a single file you would avoid some overhead with the file system.
Reduced overhead due to cluster size. Files have a minimum size, this can cause a fair bit of wasted space for small files. Using an archive avoids this.
You could also try saving each file in an individual zip-archive, but then you would mostly benefit from the compression.

Responding to your question, you have an option (add a flag on the constructor) but it is strongly tied to the garbage collector, also think about multi thread environment it could be a mess. That said this is the overloaded constructor:
StreamWriter(Stream, Encoding, Int32, Boolean)
Initializes a new instance of the StreamWriter class for the specified stream by using the specified encoding and buffer size, and optionally leaves the stream open.
public StreamWriter (System.IO.Stream stream, System.Text.Encoding? encoding = default, int bufferSize = -1, bool leaveOpen = true);
Source

Memory Stream vs File Stream for static content download

I have a scenario that every file uploaded that may be of any MIME type should be encrypted and when use wants to download they should be decrypted.
For this I have decrypted a requested file and saved that file in a temporary location.
My decryption method writes to a filestream by reading encrypted filestream.
Now should I change my algorithm to save encrypted filestream to a memory stream
and download directly from memory stream instead of writing to filestream and downloading that file.
In term of performance which would be better filestream or memory stream in this case.
I am thinking that if multiple huge file is requested by multiple users lets say 100 different files are requested by 100 different users. In this case memory may run out and we may face some unwanted troubles.
Which one I should implement.

Have you considered streaming directly into the NetworkStream for output? You will need no memory and no disc space for that. If you expect higher performance from buffering the data in a BufferedStream as an intermediate.
If direct or buffered streaming is not an option for you, the correct answer depends on your requirements. You need to know the size of your files and the expected umber of requests. If you think your web servers memory can handle it and you need the added performance, go for it. Otherwise you should buffer on the hard drive. Remember: you only need sufficient performance, not perfect performance.

Combining FileStream and MemoryStream to avoid disk accesses/paging while receiving gigabytes of data?

I'm receiving a file as a stream of byte[] data packets (total size isn't known in advance) that I need to store somewhere before processing it immediately after it's been received (I can't do the processing on the fly). Total received file size can vary from as small as 10 KB to over 4 GB.
One option for storing the received data is to use a MemoryStream, i.e. a sequence of MemoryStream.Write(bufferReceived, 0, count) calls to store the received packets. This is very simple, but obviously will result in out of memory exception for large files.
An alternative option is to use a FileStream, i.e. FileStream.Write(bufferReceived, 0, count). This way, no out of memory exceptions will occur, but what I'm unsure about is bad performance due to disk writes (which I don't want to occur as long as plenty of memory is still available) - I'd like to avoid disk access as much as possible, but I don't know of a way to control this.
I did some testing and most of the time, there seems to be little performance difference between say 10 000 consecutive calls of MemoryStream.Write() vs FileStream.Write(), but a lot seems to depend on buffer size and the total amount of data in question (i.e the number of writes). Obviously, MemoryStream size reallocation is also a factor.
Does it make sense to use a combination of MemoryStream and FileStream, i.e. write to memory stream by default, but once the total amount of data received is over e.g. 500 MB, write it to FileStream; then, read in chunks from both streams for processing the received data (first process 500 MB from the MemoryStream, dispose it, then read from FileStream)?
Another solution is to use a custom memory stream implementation that doesn't require continuous address space for internal array allocation (i.e. a linked list of memory streams); this way, at least on 64-bit environments, out of memory exceptions should no longer be an issue. Con: extra work, more room for mistakes.
So how do FileStream vs MemoryStream read/writes behave in terms of disk access and memory caching, i.e. data size/performance balance. I would expect that as long as enough RAM is available, FileStream would internally read/write from memory (cache) anyway, and virtual memory would take care of the rest. But I don't know how often FileStream will explicitly access a disk when being written to.
Any help would be appreciated.

No, trying to optimize this doesn't make any sense. Windows itself already caches file writes, they are buffered by the file system cache. So your test is about accurate, both MemoryStream.Write() and FileStream.Write() actually write to RAM and have no significant perf differences. The file system driver lazily writes it to disk in the background.
The RAM used for the file system cache is what's left over after processes claimed their RAM needs. By using a MemoryStream, you reduce the effectiveness of the file system cache. Or in other words, you trade one for the other without benefit. You're in fact off worse, you use double the amount of RAM.
Don't help, this is already heavily optimized inside the operating system.

Since recent versions of Windows enable write caching by default, I'd say you could simply use FileStream and let Windows manage when or if anything actually is written to the physical hard drive.
If these files don't stick around after you've received them, you should probably write the files to a temp directory and delete them when you're done with them.

Use a FileStream constructor that allows you to define the buffer size. For example:
using (outputFile = new FileStream("filename",
FileMode.Create, FileAccess.Write, FileShare.None, 65536))
{
}
The default buffer size is 4K. Using a 64K buffer reduces the number of calls to the file system. A larger buffer will reduce the number of writes, but each write starts to take longer. Emperical data (many years of working with this stuff) indicates that 64K is a very good choice.
As somebody else pointed out, the file system will likely do further caching, and do the actual disk write in the background. It's highly unlikely that you'll receive data faster than you can write it to a FileStream.

What is the best way to work with files in memory in C#?

I am building an ASP.NET web application that creates PowerPoint presentations on the fly. I have the basics working but it creates actual physical files on the hard disk. That doesn't seem like a good idea for a large multi-user web application. It seems like it would be better if the application created the presentations in memory and then streamed them back to the user. Instead of manipulating files should I be working with the MemoryStream class? I am not exactly sure I understand the difference between working with Files and working with Streams. Are they sort of interchangeable? Can anyone point me to a good resource for doing file type operations in memory instead of on disk? I hope I have described this well enough.
Corey

You are trying to make decision that you think impacts performance of your application based on "doesn't seem like a good idea" measurement, which is barely scientific. It would be better to implement both and compare, but first you should list your concerns about either implementations.
Here are some ideas to start:
there are really not much difference between temporary files and in-memory streams. Both would have content in physical memory if they are small enough, both will hit the disk if there is memory pressure. Consider using temporary Delete on close files for your files if cleaning files up is the main concern.
OS already doing very good job for managing large files with caching, one would need to make sure pure in-memory solution at least matches it.
MemoryStream is not the best implementation for reasonably sized streams due its "all data is in single byte array" contract (see my answer at https://stackoverflow.com/a/10424137/477420).
Managing multiple large in-memory streams (i.e. for multiple users) is fun for x86 platform, less of a concern for x64 ones.
Some API simply don't provide a way working with Stream-based classes and require physical file.

Files and streams are similar, yes. Both essentially stream a byte array...one from memory, one from the hard drive. If the API you are using allows you to generate a stream, then you can easily do that and serve it out to the user using the Response object.
The following code will take a PowerPoint memory object (you'll need to modify it for your own API, but you can get the general idea), save it to a MemoryStream, then set the proper headers and write the stream to the Response (which will then let the user save the file to their local computer):
SaveFormat format = SaveFormat.PowerPoint2007;
Slideshow show = PowerPointWriter.Generate(report, format);
MemoryStream ms = new MemoryStream();
show.Save(ms, format);
Response.Clear();
Response.Buffer = true;
Response.ContentType = "application/vnd.ms-powerpoint";
Response.AddHeader("Content-Disposition", "attachment; filename=\"Slideshow.ppt\"");
Response.BinaryWrite(ms.ToArray());
Response.End();

Yes, I would recommend the MemoryStream. Typically any time you access a file, you are doing so with a stream. There are many kinds of streams (e.g. network streams, file streams, and memory streams) and they all implement the same basic interface. If you are already creating the file in a file stream, instead of something like a string or byte array, then it should require very little coding changes to switch to a MemoryStream.
Basically, a steam is simply a way of working with large amounts of data where you don't have to, or can't, load all the data at into memory at once. So, rather than reading or writing the entire set of data into a giant array or something, you open a stream which gives you the equivalent of a cursor. You can move your current position to any spot in the stream and read or write to it from that point.

difference between memory stream and filestream

During the serialization we can use either memory stream or file stream.
What is the basic difference between these two? What does memory stream mean?
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Runtime.Serialization.Formatters.Binary;
namespace Serilization
{
class Program
{
static void Main(string[] args)
{
MemoryStream aStream = new MemoryStream();
BinaryFormatter aBinaryFormat = new BinaryFormatter();
aBinaryFormat.Serialize(aStream, person);
aStream.Close();
}
}
}

Stream is a representation of bytes. Both these classes derive from the Stream class which is abstract by definition.
As the name suggests, a FileStream reads and writes to a file whereas a MemoryStream reads and writes to the memory. So it relates to where the stream is stored.
Now it depends how you plan to use both of these. For eg: Let us assume you want to read binary data from the database, you would go in for a MemoryStream. However if you want to read a file on your system, you would go in for a FileStream.
One quick advantage of a MemoryStream is that there is not need to create temporary buffers and files in an application.

The other answers here are great, but I thought one that takes a really high level look at what purpose steams serve might be useful. There's a bit of simplification going on in the explanation below, but hopefully this gets the idea across:
What is a stream?
A stream is effectively the flow of data between two places, it's the pipe rather than the contents of that pipe.
A bad analogy to start
Imagine a water desalination plant (something that takes seawater, removes the salt and outputs clean drinking water to the water network):
The desalination plant can't remove the salt from all of the sea at one time (and nor would we want it to… where would the saltwater fish live?), so instead we have:
A SeaStream that sucks a set amount of water at a time into the plant.
That SeaStream is connected to the DesalinationStream to remove the salt
And the output of the DesalinationStream is connected to the DrinkingWaterNetworkStream to output the now saltless water to the drinking water supply.
OK, so what's that got to do with computers?
Moving big files all at once can be problematic
Frequently in computing we want to move data between two locations, e.g. from an external hard drive to a binary field in a database (to use the example given in another answer). We can do that by copying all of the data from the file from location A into the computer's memory and from there to to Location B, but if the file is large or the source or destination are potentially unreliable then moving the whole file at once may be either unfeasible or unwise.
For example, say we want to move a large file on a USB stick to a field in a database. We could use a 'System.IO.File' object to retrieve that whole file into the computer's memory and then use a database connection to pass that file onto the database.
But, that's potentially problematic, what if the file is larger than the computer's available RAM? Now the file will potentially be cached to the hard drive, which is slow, and it might even slow the computer down too.
Likewise, what if the data source is unreliable, e.g. copying a file from a network drive with a slow and flaky WiFi connection? Trying to copy a large file in one go can be infuriating because you get half the file and then the connection drops out and you have to start all over again, only for it to potentially fail again.
It can be better to split the file and move it a piece at a time
So, rather than getting whole file at once, it would be better to retrieve the file a piece at a time and pass each piece on to the destination one at a time. This is what a Stream does and that's where the two different types of stream you mentioned come in:
We can use a FileStream to retrieve data from a file a piece at a time
and the database API may make available a MemoryStream endpoint we can write to a piece at a time.
We connect those two 'pipes' together to flow the file pieces from file to database.
Even if the file wasn't too big to be held in RAM, without streams we were still doing a number or read/write operations that we didn't need to. The stages we we're carrying out were:
Retrieving the data from the disk (slow)
Writing to a File object in the computer's memory (a bit faster)
Reading from that File object in the computer's memory (faster again)
Writing to the database (probably slow as there's probably a spinning disk hard-drive at the end of that pipe)
Streams allow us to conceptually do away with the middle two stages, instead of dragging the whole file into computer memory at once, we take the output of the operation to retrieve the data and pipe that straight to the operation to pass the data onto the database.
Other benefits of streams
Separating the retrieval of the data from the writing of the data like this also allows us to perform actions between retrieving the data and passing it on. For example, we could add an encryption stage, or we could write the incoming data to more than one type of output stream (e.g. to a FileStream and a NetworkStream).
Streams also allow us to write code where we can resume the operation should the transfer fail part way through. By keeping track of the number of pieces we've moved, if the transfer fails (e.g. if the network connection drops out) we can restart the Stream from the point at which we received the last piece (this is the offset in the BeginRead method).

In simplest form, a MemoryStream writes data to memory, while a FileStream writes data to a file.
Typically, I use a MemoryStream if I need a stream, but I don't want anything to hit the disk, and I use a FileStream when writing a file to disk.

While a file stream reads from a file, a memory stream can be used to read data mapped in the computer's internal memory (RAM). You are basically reading/writing streams of bytes from memory.

Having a bitter experience on the subject, here's what I've found out. if performance is required, you should copy the contents of a filestream to a memorystream. I had to process the contents of 144 files, 528kbytes each and present the outcome to the user. It took 250 seconds aprox. (!!!!). When I just copied the contents of each filestream to a memorystream, (CopyTo method) without changing anything at all, the time dropped to approximately 32 seconds. Note that each time you copy one stream to another, the stream is appended at the end of the destination stream, so you may need to 'rewind' it prior to copying to it. Hope it helps.

In regards to stream itself, in general, it means that when you put a content into the stream (memory), it will not put all the content of whatever data source (file, db...) you are working with, to the memory. As opposed to for example Arrays or Buffers, where you feed everything to the memory. In stream, you get a chunk of eg. file to the memory. When you reach the end of chunk, stream gets the next chunk from file to the memory. It all happens in low-level background while you are just iterating the stream. That's why it's called stream.

A memory stream handles data via an in memory buffer. A filestream deals with files on disk.

Serializing objects in memory is hardly any useful, in my opinion. You need to serialize an object when you want to save it on disk. Typically, serialization is done from the object(which is in memory) to the disk while deserialization is done from the saved serialized object(on the disk) to the object(in memory).
So, most of the times, you want to serialize to disk, thus you use a Filestream for serialization.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.