C# Reuse StreamWriter or FileStream but change destination file - c#

A little background...
Everything I'm about to describe up to my implementation of the StreamWriter is business processes which I cannot change.
Every month I pull around 200 different tables of data into individual files.
Each file contains roughly 400,000 lines of business logic details for upwards of 5,000-6,000 different business units.
To effectively use that data with the tools on hand, I have to break down those files into individual files for each business unit...
200 files x 5000 business units per file = 100,000 different files.
The way I've BEEN doing it is the typical StreamWriter loop...
foreach(string SplitFile in List<BusinessFiles>)
{
using (StreamWriter SW = new StreamWriter(SplitFile))
{
foreach(var BL in g)
{
string[] Split1 = BL.Split(',');
SW.WriteLine("{0,-8}{1,-8}{2,-8}{3,-8}{4,-8}{5,-8}{6,-8}{7,-8}{8,-8}{9,-16}{10,-1}",
Split1[0], Split1[1], Split1[2], Split1[3], Split1[4], Split1[5], Split1[6], Split1[7], Split1[8], Convert.ToDateTime(Split1[9]).ToString("dd-MMM-yyyy"), Split1[10]);
}
}
}
The issue with this is, It takes an excessive amount of time.
Like, It can take 20 mins to process all the files sometimes.
Profiling my code shows me that 98% of the time spent is on the system disposing of the StreamWriter after the program leaves the loop.
So my question is......
Is there a way to keep the underlying Stream open and reuse it to write a different file?
I know I can Flush() the Stream but I can't figure out how to get it to start writing to another file altogether. I can't seem to find a way to change the destination filename without having to call another StreamWriter.
Edit:
A picture of what it shows when I profile the code
Edit 2:
So after poking around a bit more I started looking at it a different way.
First thing is, I already had the reading of the one file and writing of the massive amount of smaller files in a nested parallel loop so I was essentially maxing out my I/O as is.
I'm also writing to an SSD, so all those were good points.
Turns out I'm reading the 1 massive file and writing ~5600 smaller ones every 90 seconds or so.
That's 60 files a second. I guess I can't really ask for much more than that.

This sounds about right. 100,000 files in 20 minutes is more than 83 files every second. Disk I/O is pretty much the slowest thing you can do within a single computer. All that time in the Dispose() method is waiting for the buffer to flush out to disk while closing the file... it's the actual time writing the data to your persistent storage, and a separate using block for each file is the right way to make sure this is done safely.
To speed this up it's tempting to look at asynchronous processing (async/await), but I don't think you'll find any gains there; ultimately this is an I/O-bound task, so optimizing for your CPU scheduling might even make things worse. Better gains could be available if you can change the output to write into a single (indexed) file, so the operating system's disk buffering mechanism can be more efficient.

I would agree with Joel that the time is mostly due to writing the data out to disk. I would however be a little bit more optimistic about doing parallel IO, since SSDs are better able to handle higher loads than regular HDDs. So I would try a few things:
1. Doing stuff in parallel
Change your outer loop to a parallel one
Parallel.ForEach(
myBusinessFiles,
new ParallelOptions(){MaxDegreeOfParallelism = 2},
SplitFile => {
// Loop body
});
Try changing the degree of parallelism to see if performance improves or not. This assumes the data is thread safe.
2. Try writing high speed local SSD
I'm assuming you are writing to a network folder, this will add some additional latency, so you might try to write to a local disk. If you are already doing that, consider getting a faster disk. If you need to move all the files to a network drive afterwards, you will likely not gain anything, but it can give an idea about the penalty you get from the network.
3. Try writing to a Zip Archive
There are zip archives that can contain multiple files inside it, while still allowing for fairly easy access of an individual file. This could help improve performance in a few ways:
Compression. I would assume your data is fairly easy to compress, so you would write less data overall.
Less file system operations. Since you are only writing to a single file you would avoid some overhead with the file system.
Reduced overhead due to cluster size. Files have a minimum size, this can cause a fair bit of wasted space for small files. Using an archive avoids this.
You could also try saving each file in an individual zip-archive, but then you would mostly benefit from the compression.

Responding to your question, you have an option (add a flag on the constructor) but it is strongly tied to the garbage collector, also think about multi thread environment it could be a mess. That said this is the overloaded constructor:
StreamWriter(Stream, Encoding, Int32, Boolean)
Initializes a new instance of the StreamWriter class for the specified stream by using the specified encoding and buffer size, and optionally leaves the stream open.
public StreamWriter (System.IO.Stream stream, System.Text.Encoding? encoding = default, int bufferSize = -1, bool leaveOpen = true);
Source

Related

Benefits of saving multiple files async

I'm writing an action on my controller which saves files to disk. On .Net Core 2.0
I saw some code which saved files like this.
foreach (var formFile in files)
{
if (formFile.Length > 0)
{
using (var stream = new FileStream(filePath, FileMode.Create))
{
await formFile.CopyToAsync(stream);
}
}
}
This is saving files async but sequentially. So I decided to write it a bit differently
var fileTasks = files.Where(f => f.Length > 0).Select(f => this.SaveFile(f, BASE_PATH));
await Task.WhenAll(fileTasks);
protected async Task SaveFile(IFormFile file, string basePath)
{
var fileName = Path.GetTempFileName();
var filePath = Path.Combine(basePath, fileName);
using (var stream = new FileStream(filePath, FileMode.Create))
{
await file.CopyToAsync(stream);
}
}
Assuming I'm saving them all to the same drive, would there be any benefit of doing this?
I'm aware I wouldn't be blocking on any threads, but would would there still be a bottle neck at the Disc? Or can Modern computers save more than 1 file at once?
would would there still be a bottle neck at the Disc? Or can Modern computers save more than 1 file at once?
Yes, and yes. The disk, being orders of magnitude slower than the rest of the computer, will always be a bottle-neck. But, while it is not possible to literally write to more places on a disk at once than there are write heads (rotating media disks almost all have multiple write heads, because there are multiple platters and platter sides on almost all such disks), certainly modern computers (and even not-so-modern computers) can track the I/O for multiple files at once.
The short answer to the broader question: the only way to know for sure, with respect to any performance question, is to test it. No one here can predict what the outcome will be. This is true even for relatively simple CPU-bound problems, and it's even more significant when you're dealing with something as complex as writing data to a storage device.
And even if you find you can make the file I/O faster now, that effort may or may not remain relevant in the future. It's even possible you could wind up with your code being slower than a simpler implementation.
The longer version…
Issues that affect the actual performance include:
Type of drive. Conventional hard disks with rotating media are generally much slower than SSD, but each type of drive has its own particular performance characteristics.
Configuration of drive. Different manufacturers ship drives with different disk RPMs (for rotating drives), different controllers, different cache sizes and types, and varying support for disk protocols. A logical drive might actually be multiple physical drives (e.g. RAID), and even within a drive the storage can be configured differently: rotating media drives can have varying numbers of platters for a given amount of storage, and SSDs can use a variety of storage technologies and arrangements (i.e. single-level vs. multi-level cells, with different block sizes and layouts. This is far from an exhaustive list of the types of variations one might see in disk drives.
File system. Even Windows supports a wide range of file systems, and other OS's have an even broader variety of options. Each file system has specific things it's good at and poor at, and performance will depend on the exact nature of how the files are being accessed.
Driver software. Drives mostly use standardized APIs and typically a basic driver in the OS is used for all types of drives. But there are exceptions to the rule.
Operating system version and configuration. Different versions of Windows, or any other OS, have subtly different implementations for dealing with disk I/O. Even within a given version of an OS, a given drive may be configured differently, with options for caching.
Some generalizations can be made, but for every true generalization, there will be an exception. Murphy's Law leads us to conclude that if you ignore real-world testing of your implementation, you'll wind up being the exception.
All that said, it is possible that writing to multiple files concurrently can improve throughput, at least for disks with rotating media. Why?
While the comment above from #Plutonix is correct, it does gloss over the fact that the disk controller will optimize as best it can the writes. Having multiple writes queued at once (whether due to multiple files or a single file spread around the disk) allows the disk controller to take advantage of the current position of the disk.
Consider, for example, if you were to write a file one block at a time. You write a block, when you find it's been written, you write another. Well, the disk's moved by the time you get around to writing the next block, so now you get to wait for the proper location to come back around to the write head before the next write can complete.
So, what if you hand over two blocks to the OS at a time? Now, the disk controller can be told about both blocks, and if one block can be written immediately after another, it's there ready to be written. No waiting for another rotation of the disk.
The more blocks you can hand over at once, and the more the disk controller can see to write at once, the better the odds of it being able to write blocks continuously as the platter spins under the write head, without having to pause and wait for the right spot to come back around.
So, why not always write files this way? Well, the biggest reason is that we usually don't need to write data that fast. The user is not inconvenienced by file I/O taking 500 ms instead of 50.
Plus, it significantly increases the complexity of the code.
In addition, the programming frameworks, operating system, file system, and disk controller all have features that provide much or all of the same benefit, without the program itself having to work harder. Buffering at every level of disk I/O means that when your program writes to a file, it thinks the write went really fast, but all that happened was all that data got squirreled away by one or more layers in the disk I/O pipeline, allowing those layers to provide enough data to the disk at once for optimizations involving timing writes for platter position to be done transparently to your program.
Often — almost all the time, I'd guess — if your program is simply streaming data sequentially quickly enough, even without any concurrency the disk can still be kept at a high level of efficiency, because the buffers are large enough to ensure that for any writeable block that goes under the write head, there's a block of data ready to write to it.
Naturally, SSDs change the analysis significantly. Latency on the physical media is no longer an issue, but there are lots more different ways to build an SSD, and each will come with different performance characteristics. On top of that, the technology for SSDs is still changing quickly. The people who design and build SSDs, their controllers, and even the operating systems that use them, work hard to ensure that even naïve programs work efficiently.
So, in general, just write your code naïvely. It's a lot less work to do so, and in most cases it'll work just as well. If you do decide to measure performance, and find that you can make disk I/O work more efficiently by writing to multiple files asynchronously, plan on rechecking your results periodically over time. Changes to disk technology can easily render your optimizations null and void, or even counter-productive.
Related reading:
How to handle large numbers of concurrent disk write requests as efficiently as possible
outputing dictionary optimally
Performance creating multiple small files
What is the maximum number of simultaneous I/O operations in .net 4.5?

Having multiple simultaneous writers (no reader) to a single file. Is it possible to accomplish in a performant way in .NET?

I'm developing a multiple segment file downloader. To accomplish this task I'm currently creating as many temporary files on disk as segments I have (they are fixed in number during the file downloading). In the end I just create a new file f and copy all the segments' contents onto f.
I was wondering if there's not a better way to accomplish this. My idealization is of initially creating f in its full-size and then have the different threads write directly onto their portion. There need not to be any kind of interaction between them. We can assume any of them will start at its own starting point in the file and then only fill information sequentially in the file until its task is over.
I've heard about Memory-Mapped files (http://msdn.microsoft.com/en-us/library/dd997372(v=vs.110).aspx) and I'm wondering if they are the solution to my problem or not.
Thanks
Using the memory mapped API is absolute doable and it will probably perform quite well - of cause some testing would be recommended.
If you want to look for a possible alternative implementation, I have the following suggestion.
Create a static stack data structure, where the download threads can push each file segment as soon as it's downloaded.
Have a separate thread listen for push notifications on the stack. Pop the stack file segments and save each segment into the target file in a single threaded way.
By following the above pattern, you have separated the download of file segments and the saving into a regular file, by putting a stack container in between.
Depending on the implementation of the stack handling, you will be able to implement this with very little thread locking, which will maximise performance.
The pros of this is that you have 100% control on what is going on and a solution that might be more portable (if that ever should be a concern).
The stack decoupling pattern you do, can also be implemented pretty generic and might even be reused in the future.
The implementation of this is not that complex and probably on par with the implementation needed to be done around the memory mapped api.
Have fun...
/Anders
The answers posted so far are, of course addressing your question but you should also consider the fact that multi-threaded I/O writes will most likely NOT give you gains in performance.
The reason for multi-threading downloads is obvious and has dramatic results. When you try to combine the files though, remember that you are having multiple threads manipulate a mechanical head on conventional hard drives. In case of SSD's you may gain better performance.
If you use a single thread, you are by far exceeding the HDD's write capacity in a SEQUENTIAL way. That IS by definition the fastest way to write to conventions disks.
If you believe otherwise, I would be interested to know why. I would rather concentrate on tweaking the write performance of a single thread by playing around with buffer sizes, etc.
Yes, it is possible but the only precaution you need to have is to control that no two threads are writing at the same location of file, otherwise file content will be incorrect.
FileStream writeStream = new FileStream(destinationPath, FileMode.OpenOrCreate, FileAccess.Write, FileShare.Write);
writeStream.Position = startPositionOfSegments; //REMEMBER This piece of calculation is important
// A simple function to write the bytes ... just read from your source and then write
writeStream.Write(ReadBytes, 0 , bytesReadFromInputStream);
After each Write we used writeStream.Flush(); so that buffered data gets written to file but you can change according to your requirement.
Since you have code already working which downloads the file segments in parallel. The only change you need to make is just open the file stream as posted above, and instead of creating many segments file locally just open stream for a single file.
The startPositionOfSegments is very important and calculate it perfectly so that no two segments overwrite the desired downloaded bytes to same location on file otherwise it will provide incorrect result.
The above procedure works perfectly fine at our end, but this can be problem if your segment size are too small (We too faced it but after increasing size of segments it got fixed). If you face any exception then you can also synchronize only the Write part.

How ReadLine works in .NET

Let's say I have a 1 GB text file and I want to read it. If I try to open this file, I would get an "Memory Overflow" error. I know, the usual answer is "Use StreamReader.ReadLine() method". But I am wondering how this works. If the program which uses ReadLine method wants to get a line, it will have to open the entire text file sooner or later. As far as I know, files are stored on the disk and they can be opened in memory in an "all or nothing" principle. If only one line of my 1 GB text file is stored in a memory at a time by using a ReadLine() method, this means that we have to disk I-O for every line of my 1 GB text file while reading it. Isn't this a terrible thing to do for performance?
I'm so confused and I want some details about this.
this means that we have to disk I-O for every line of my 1 GB text file
No, there are lots of layers between your ReadLine() call and the physical disk, designed to not make this a problem. The ones that matter most:
FileStream, the underlying class that does the job for StreamReader, uses a buffer to reduce the number of ReadFile() calls. Default size is 4096 bytes
ReadFile() reads file data from the file system cache, not the disk. That may result in a call to the disk driver, but that's not so common. The operating system is smart enough to guess that you are likely to read more data from the file and pre-reads it from the disk as long as that is cheap to do and RAM isn't being used for anything else. It typically slurps an entire disk cylinder worth of data.
The disk drive itself has a cache as well, usually several megabytes.
The file system cache is by far the most important one. Also a tricky one because it stops your from accurately profiling your program. When you run your test over and over again, your program in fact never reads from the disk, only the cache. Which makes it unrealistically fast. Albeit that a 1 GB file might not quite fit, depends how much RAM you have in the machine.
Usually behind the scenes a FileStream object is opened which reads a large block of your file from disk and pulls it into memory. This block acts as a cache for ReadLine() to read from, so you don't have to worry about each ReadLine() causing a disk access.
Terrible thing for the performance of what?
Obviously it should be faster, given you have the memory available to deal the whole file in memory.
Finding and allocating a contiguous block is a cost though.
A gig is a significant block of ram, if your process has it, what's hurting?
Swapping could easily hurt more than streaming.
Do you need all the file at once, Do you need it all the time?
If you went to read / write. What would that do to you?
What if the file went to 2 gig?
You can optimise for one factor. Before you do, you've got to make sure it's the right one, and above all you have to remember this is a real machine. You have a finite amount of resources, so optimisation is always robbing Peter to pay Paul. Peter might get upset...

how to improve a large number of smaller files read and write speed or performance

Yesterday,I asked the question at here:how do disable disk cache in c# invoke win32 CreateFile api with FILE_FLAG_NO_BUFFERING.
In my performance test show(write and read test,1000 files and total size 220M),the FILE_FLAG_NO_BUFFERING can't help me improve performance and lower than .net default disk cache,since i try change FILE_FLAG_NO_BUFFERING to FILE_FLAG_SEQUENTIAL_SCAN can to reach the .net default disk cache and faster little.
before,i try use mongodb's gridfs feature replace the windows file system,not good(and i don't need to use distributed feature,just taste).
in my Product,the server can get a lot of the smaller files(60-100k) on per seconds through tcp/ip,then need save it to the disk,and third service read these files once(just read once and process).if i use asynchronous I/O whether can help me,whether can get best speed and best low cpu cycle?. someone can give me suggestion?or i can still use FileStream class?
update 1
the memory mapped file whether can to achieve my demand.that all files write to one big file or more and read from it?
If your PC is taking 5-10 seconds to write a 100kB file to disk, then you either have the world's oldest, slowest PC, or your code is doing something very inefficient.
Turning off disk caching will probably make things worse rather than better. With a disk cache in place, your writes will be fast, and Windows will do the slow part of flushing the data to disk later. Indeed, increasing I/O buffering usually results in significantly improved I/O in general.
You definitely want to use asynchronous writes - that means your server starts the data writing, and then goes back to responding to its clients while the OS deals with writing the data to disk in the background.
There shouldn't be any need to queue the writes (as the OS will already be doing that if disc caching is enabled), but that is something you could try if all else fails - it could potentially help by writing only one file at a time to minimise the need for disk seeks..
Generally for I/O, using larger buffers helps to increase your throughput. For example instead of writing each individual byte to the file in a loop, write a buffer-ful of data (ideally the entire file, for the sizes you mentioned) in one Write operation. This will minimise the overhead (instead of calling a write function for every byte, you call a function once for the entire file). I suspect you may be doing something like this, as it's the only way I know to reduce performance to the levels you've suggested you are getting.
Memory-mapped files will not help you. They're really best for accessing the contents of huge files.
One of buggest and significant improvements, in your case, can be, imo, process the filles without saving them to a disk and after, if you really need to store them, push them on Queue and provess it in another thread, by saving them on disk. By doing this you will immidiately get processed data you need, without losing time to save a data on disk, but also will have a file on disk after, without losing computational power of your file processor.

Efficient log backup program in C#

I am writing a log backup program in C#. The main objective is to take logs from multiple servers, copy and compress the files and then move them to a central data storage server. I will have to move about 270Gb of data every 24 hours. I have a dedicated server to run this job and a LAN of 1Gbps. Currently I am reading lines from a (text)file, copying them into a buffer stream and writing them to the destination.
My last test copied about 2.5Gb of data in 28 minutes. This will not do. I will probably thread the program for efficiency, but I am looking for a better method to copy the files.
I was also playing with the idea of compressing everything first and then using a stream buffer a bit to copy. Really, I am just looking for a little advice from someone with more experience than me.
Any help is appreciated, thanks.
You first need to profile as Umair said so that you can figure out how much of the 28 minutes is spent compressing vs. transmitting. Also measure the compression rate (bytes/sec) with different compression libraries, and compare your transfer rate against other programs such as Filezilla to see if you're close to your system's maximum bandwidth.
One good library to consider is DotNetZip, which allows you to zip to a stream, which can be handy for large files.
Once you get it fine-tuned for one thread, experiment with several threads and watch your processor utilization to see where the sweet spot is.
One of the solutions can be is what you mantioned: compress files in one Zip file and after transfer them via network. This will bemuch faster as you are transfering one file and often on of principal bottleneck during file transfers is Destination security checks.
So if you use one zip file, there should be one check.
In short:
Compress
Transfer
Decompress (if you need)
This already have to bring you big benefits in terms of performance.
Compress the logs at source and use TransmitFile (that's a native API - not sure if there's a framework equivalent, or how easy it is to P/Invoke this) to send them to the destination. (Possibly HttpResponse.TransmitFile does the same in .Net?)
In any event, do not read your files linewise - read the files in blocks (loop doing FileStream.Read for 4K - say - bytes until read count == 0) and send that direct to the network pipe.
Trying profiling your program... bottleneck is often where you least expect it to be. As some clever guy said "Premature optimisation is the root of all evil".
Once in a similar scenario at work, I was given the task to optimise the process. And after profiling the bottleneck was found to be a call to sleep function (which was used for synchronisation between thread!!!! ).

Categories