Benefits of saving multiple files async - c#

I'm writing an action on my controller which saves files to disk. On .Net Core 2.0
I saw some code which saved files like this.
foreach (var formFile in files)
{
if (formFile.Length > 0)
{
using (var stream = new FileStream(filePath, FileMode.Create))
{
await formFile.CopyToAsync(stream);
}
}
}
This is saving files async but sequentially. So I decided to write it a bit differently
var fileTasks = files.Where(f => f.Length > 0).Select(f => this.SaveFile(f, BASE_PATH));
await Task.WhenAll(fileTasks);
protected async Task SaveFile(IFormFile file, string basePath)
{
var fileName = Path.GetTempFileName();
var filePath = Path.Combine(basePath, fileName);
using (var stream = new FileStream(filePath, FileMode.Create))
{
await file.CopyToAsync(stream);
}
}
Assuming I'm saving them all to the same drive, would there be any benefit of doing this?
I'm aware I wouldn't be blocking on any threads, but would would there still be a bottle neck at the Disc? Or can Modern computers save more than 1 file at once?

would would there still be a bottle neck at the Disc? Or can Modern computers save more than 1 file at once?
Yes, and yes. The disk, being orders of magnitude slower than the rest of the computer, will always be a bottle-neck. But, while it is not possible to literally write to more places on a disk at once than there are write heads (rotating media disks almost all have multiple write heads, because there are multiple platters and platter sides on almost all such disks), certainly modern computers (and even not-so-modern computers) can track the I/O for multiple files at once.
The short answer to the broader question: the only way to know for sure, with respect to any performance question, is to test it. No one here can predict what the outcome will be. This is true even for relatively simple CPU-bound problems, and it's even more significant when you're dealing with something as complex as writing data to a storage device.
And even if you find you can make the file I/O faster now, that effort may or may not remain relevant in the future. It's even possible you could wind up with your code being slower than a simpler implementation.
The longer version…
Issues that affect the actual performance include:
Type of drive. Conventional hard disks with rotating media are generally much slower than SSD, but each type of drive has its own particular performance characteristics.
Configuration of drive. Different manufacturers ship drives with different disk RPMs (for rotating drives), different controllers, different cache sizes and types, and varying support for disk protocols. A logical drive might actually be multiple physical drives (e.g. RAID), and even within a drive the storage can be configured differently: rotating media drives can have varying numbers of platters for a given amount of storage, and SSDs can use a variety of storage technologies and arrangements (i.e. single-level vs. multi-level cells, with different block sizes and layouts. This is far from an exhaustive list of the types of variations one might see in disk drives.
File system. Even Windows supports a wide range of file systems, and other OS's have an even broader variety of options. Each file system has specific things it's good at and poor at, and performance will depend on the exact nature of how the files are being accessed.
Driver software. Drives mostly use standardized APIs and typically a basic driver in the OS is used for all types of drives. But there are exceptions to the rule.
Operating system version and configuration. Different versions of Windows, or any other OS, have subtly different implementations for dealing with disk I/O. Even within a given version of an OS, a given drive may be configured differently, with options for caching.
Some generalizations can be made, but for every true generalization, there will be an exception. Murphy's Law leads us to conclude that if you ignore real-world testing of your implementation, you'll wind up being the exception.
All that said, it is possible that writing to multiple files concurrently can improve throughput, at least for disks with rotating media. Why?
While the comment above from #Plutonix is correct, it does gloss over the fact that the disk controller will optimize as best it can the writes. Having multiple writes queued at once (whether due to multiple files or a single file spread around the disk) allows the disk controller to take advantage of the current position of the disk.
Consider, for example, if you were to write a file one block at a time. You write a block, when you find it's been written, you write another. Well, the disk's moved by the time you get around to writing the next block, so now you get to wait for the proper location to come back around to the write head before the next write can complete.
So, what if you hand over two blocks to the OS at a time? Now, the disk controller can be told about both blocks, and if one block can be written immediately after another, it's there ready to be written. No waiting for another rotation of the disk.
The more blocks you can hand over at once, and the more the disk controller can see to write at once, the better the odds of it being able to write blocks continuously as the platter spins under the write head, without having to pause and wait for the right spot to come back around.
So, why not always write files this way? Well, the biggest reason is that we usually don't need to write data that fast. The user is not inconvenienced by file I/O taking 500 ms instead of 50.
Plus, it significantly increases the complexity of the code.
In addition, the programming frameworks, operating system, file system, and disk controller all have features that provide much or all of the same benefit, without the program itself having to work harder. Buffering at every level of disk I/O means that when your program writes to a file, it thinks the write went really fast, but all that happened was all that data got squirreled away by one or more layers in the disk I/O pipeline, allowing those layers to provide enough data to the disk at once for optimizations involving timing writes for platter position to be done transparently to your program.
Often — almost all the time, I'd guess — if your program is simply streaming data sequentially quickly enough, even without any concurrency the disk can still be kept at a high level of efficiency, because the buffers are large enough to ensure that for any writeable block that goes under the write head, there's a block of data ready to write to it.
Naturally, SSDs change the analysis significantly. Latency on the physical media is no longer an issue, but there are lots more different ways to build an SSD, and each will come with different performance characteristics. On top of that, the technology for SSDs is still changing quickly. The people who design and build SSDs, their controllers, and even the operating systems that use them, work hard to ensure that even naïve programs work efficiently.
So, in general, just write your code naïvely. It's a lot less work to do so, and in most cases it'll work just as well. If you do decide to measure performance, and find that you can make disk I/O work more efficiently by writing to multiple files asynchronously, plan on rechecking your results periodically over time. Changes to disk technology can easily render your optimizations null and void, or even counter-productive.
Related reading:
How to handle large numbers of concurrent disk write requests as efficiently as possible
outputing dictionary optimally
Performance creating multiple small files
What is the maximum number of simultaneous I/O operations in .net 4.5?

Related

C# Reuse StreamWriter or FileStream but change destination file

A little background...
Everything I'm about to describe up to my implementation of the StreamWriter is business processes which I cannot change.
Every month I pull around 200 different tables of data into individual files.
Each file contains roughly 400,000 lines of business logic details for upwards of 5,000-6,000 different business units.
To effectively use that data with the tools on hand, I have to break down those files into individual files for each business unit...
200 files x 5000 business units per file = 100,000 different files.
The way I've BEEN doing it is the typical StreamWriter loop...
foreach(string SplitFile in List<BusinessFiles>)
{
using (StreamWriter SW = new StreamWriter(SplitFile))
{
foreach(var BL in g)
{
string[] Split1 = BL.Split(',');
SW.WriteLine("{0,-8}{1,-8}{2,-8}{3,-8}{4,-8}{5,-8}{6,-8}{7,-8}{8,-8}{9,-16}{10,-1}",
Split1[0], Split1[1], Split1[2], Split1[3], Split1[4], Split1[5], Split1[6], Split1[7], Split1[8], Convert.ToDateTime(Split1[9]).ToString("dd-MMM-yyyy"), Split1[10]);
}
}
}
The issue with this is, It takes an excessive amount of time.
Like, It can take 20 mins to process all the files sometimes.
Profiling my code shows me that 98% of the time spent is on the system disposing of the StreamWriter after the program leaves the loop.
So my question is......
Is there a way to keep the underlying Stream open and reuse it to write a different file?
I know I can Flush() the Stream but I can't figure out how to get it to start writing to another file altogether. I can't seem to find a way to change the destination filename without having to call another StreamWriter.
Edit:
A picture of what it shows when I profile the code
Edit 2:
So after poking around a bit more I started looking at it a different way.
First thing is, I already had the reading of the one file and writing of the massive amount of smaller files in a nested parallel loop so I was essentially maxing out my I/O as is.
I'm also writing to an SSD, so all those were good points.
Turns out I'm reading the 1 massive file and writing ~5600 smaller ones every 90 seconds or so.
That's 60 files a second. I guess I can't really ask for much more than that.
This sounds about right. 100,000 files in 20 minutes is more than 83 files every second. Disk I/O is pretty much the slowest thing you can do within a single computer. All that time in the Dispose() method is waiting for the buffer to flush out to disk while closing the file... it's the actual time writing the data to your persistent storage, and a separate using block for each file is the right way to make sure this is done safely.
To speed this up it's tempting to look at asynchronous processing (async/await), but I don't think you'll find any gains there; ultimately this is an I/O-bound task, so optimizing for your CPU scheduling might even make things worse. Better gains could be available if you can change the output to write into a single (indexed) file, so the operating system's disk buffering mechanism can be more efficient.
I would agree with Joel that the time is mostly due to writing the data out to disk. I would however be a little bit more optimistic about doing parallel IO, since SSDs are better able to handle higher loads than regular HDDs. So I would try a few things:
1. Doing stuff in parallel
Change your outer loop to a parallel one
Parallel.ForEach(
myBusinessFiles,
new ParallelOptions(){MaxDegreeOfParallelism = 2},
SplitFile => {
// Loop body
});
Try changing the degree of parallelism to see if performance improves or not. This assumes the data is thread safe.
2. Try writing high speed local SSD
I'm assuming you are writing to a network folder, this will add some additional latency, so you might try to write to a local disk. If you are already doing that, consider getting a faster disk. If you need to move all the files to a network drive afterwards, you will likely not gain anything, but it can give an idea about the penalty you get from the network.
3. Try writing to a Zip Archive
There are zip archives that can contain multiple files inside it, while still allowing for fairly easy access of an individual file. This could help improve performance in a few ways:
Compression. I would assume your data is fairly easy to compress, so you would write less data overall.
Less file system operations. Since you are only writing to a single file you would avoid some overhead with the file system.
Reduced overhead due to cluster size. Files have a minimum size, this can cause a fair bit of wasted space for small files. Using an archive avoids this.
You could also try saving each file in an individual zip-archive, but then you would mostly benefit from the compression.
Responding to your question, you have an option (add a flag on the constructor) but it is strongly tied to the garbage collector, also think about multi thread environment it could be a mess. That said this is the overloaded constructor:
StreamWriter(Stream, Encoding, Int32, Boolean)
Initializes a new instance of the StreamWriter class for the specified stream by using the specified encoding and buffer size, and optionally leaves the stream open.
public StreamWriter (System.IO.Stream stream, System.Text.Encoding? encoding = default, int bufferSize = -1, bool leaveOpen = true);
Source

Are there any persistence guarantees when using memory mapped files or plain Stream.Write

I have lots of data which I would like to save to disk in binary form and I would like to get as close to having ACID properties as possible. Since I have lots of data and cannot keep it all in memory, I understand I have two basic approaches:
Have lots of small files (e.g. write to disk every minute or so) - in case of a crash I lose only the last file. Performance will be worse, however.
Have a large file (e.g. open, modify, close) - best sequential read performance afterwards, but in case of a crash I can end up with a corrupted file.
So my question is specifically:
If I choose to go for the large file option and open it as a memory mapped file (or using Stream.Position and Stream.Write), and there is a loss of power, are there any guarantees to what could possibly happen with the file?
Is it possible to lose the entire large file, or just end up with the data corrupted in the middle?
Does NTFS ensure that a block of certain size (4k?) always gets written entirely?
Is the outcome better/worse on Unix/ext4?
I would like to avoid using NTFS TxF since Microsoft already mentioned it's planning to retire it. I am using C# but the language probably doesn't matter.
(additional clarification)
It seems that there should be a certain guarantee, because -- unless I am wrong -- if it was possible to lose the entire file (or suffer really weird corruption) while writing to it, then no existing DB would be ACID, unless they 1) use TxF or 2) make a copy of the entire file before writing? I don't think journal will help you if you lose parts of the file you didn't even plan to touch.
You can call FlushViewOfFile, which initiates dirty page writes, and then FlushFileBuffers, which according to this article, guarantees that the pages have been written.
Calling FlushFileBuffers after each write might be "safer" but it's not recommended. You have to know how much loss you can tolerate. There are patterns that limit that potential loss, and even the best databases can suffer a write failure. You just have to come back to life with the least possible loss, which typically demands some logging with a multi-phase commit.
I suppose it's possible to open the memory mapped file with FILE_FLAG_NO_BUFFERING and FILE_FLAG_WRITE_THROUGH but that's gonna suck up your throughput. I don't do this. I open the memory mapped files for asynchronous I/O, letting the OS optimize the throughput with it's own implementation of async I/O completion ports. It's the fastest possible throughput. I can tolerate potential loss, and have mitigated appropriately. My memory mapped data is file backup data...and if I detect loss, I can can detect and re-backup the lost data once the hardware error is cleared.
Obviously, the file system has to be reliable enough to operate a database application, but I don't know of any vendors that suggest you don't still need backups. Bad things will happen. Plan for loss. One thing I do is that I never write into the middle of data. My data is immutable and versioned, and each "data" file is limited to 2gb, but each application employs different strategies.
The NTFS file system (and ext3-4) uses a transaction journal to operate the changes. Each changed is stored in the journal and the then, the journal itself is used to effectively peform the change.
Except for catastrophic disk failures, the file system is designed to be consistent in its own data structures, not yours: in case of a crash, the recovery procedure will decide what to roll back in order to preserve the consistency. In case of roll back, your "not-yet-written but to-be-written" data is lost.
The file system will be consistent, while your data not.
Additionally, there are several other factors involved: software and hardware caches introduce an additional layer, and therefore a point of failure. Usually the operations are performed in the cache, and then, the cache itself is flushed on disk. The file system driver won't see the operations performed "in" the cache, but we'll see the flush operations.
This is done for performances reasons, as the hard drive is the bottleneck. Hardware controllers do have batteries to guarantee that their own cache can be flushed even in an event of power loss.
The size of a sector is another important factor, but this detail should not be taken into account as the hard drive itself could lie about its native size for interoperability purposes.
If you have a mewmory mapped and you insert data in the middle, while the power goes down, the content of the file might partially contain the change you did if it exceeds the size of the internal buffers.
TxF is a way to mitigate the issue, but has several implications which limits the contexts where you can use it: for eaxample it does not work on different drives or shared networks.
In order to be ACID, you need to design your data structures and/or the way you use it in order not to rely about the implementation details. For example, Mercurial (versioning tool) always appends its own data to its own revision log.
There are many possible patterns, however, the more guarantees you need, the more technology specific you'll get (and by tied to).

Parallel Concurrent Binary Readers

I Have a Parallel.Foreach Loop creating Binary Readers on the same group of large Data Files
I was just wondering if it hurts performance that these readers are reading the same files in a Parallel Fashion (i.e, if they were reading exclusively different files would it go faster ?)
I am asking because there is a lot of I/O Disk access involved (I guess...)
Edit : I forgot to mention : I am using an Amazon EC2 instance and data is on the C:\ Disk assigned to it. I have no Idea how it affects this issue.
Edit 2: I'll make measurements duplicating the data folder and reading from 2 different sources and see what it gives.
It's not a good idea to read from the same disk using multiple threads. Since the disk's mechanical head needs to spin every time to seek the next reading location, you are basically bouncing it around with multiple threads, thus hurting performance.
The best approach is actually to read the files sequentially using a single thread and then handing the chunks to a group of threads to process them in parallel.
It depends on where your files are. If you're using one mechanical hard-disk, then no - don't read files in parallel, it's going to hurt performance. You may have other configurations, though:
On a single SDD, reading files in parallel will probably not hurt performance, but I don't expect you'll gain anything.
On two mirrored disks using RAID 1 and a half-decent RAID controller, you can read two files at once and gain considerable performance.
If your files are stored on a SAN, you can most definitely read a few at a time and improve performance.
You'll have to try it, but you have to be careful with this - if the files aren't large enough, the OS caching mechanisms are going to affect your measurements, and the second test run is going to be really fast.

how to improve a large number of smaller files read and write speed or performance

Yesterday,I asked the question at here:how do disable disk cache in c# invoke win32 CreateFile api with FILE_FLAG_NO_BUFFERING.
In my performance test show(write and read test,1000 files and total size 220M),the FILE_FLAG_NO_BUFFERING can't help me improve performance and lower than .net default disk cache,since i try change FILE_FLAG_NO_BUFFERING to FILE_FLAG_SEQUENTIAL_SCAN can to reach the .net default disk cache and faster little.
before,i try use mongodb's gridfs feature replace the windows file system,not good(and i don't need to use distributed feature,just taste).
in my Product,the server can get a lot of the smaller files(60-100k) on per seconds through tcp/ip,then need save it to the disk,and third service read these files once(just read once and process).if i use asynchronous I/O whether can help me,whether can get best speed and best low cpu cycle?. someone can give me suggestion?or i can still use FileStream class?
update 1
the memory mapped file whether can to achieve my demand.that all files write to one big file or more and read from it?
If your PC is taking 5-10 seconds to write a 100kB file to disk, then you either have the world's oldest, slowest PC, or your code is doing something very inefficient.
Turning off disk caching will probably make things worse rather than better. With a disk cache in place, your writes will be fast, and Windows will do the slow part of flushing the data to disk later. Indeed, increasing I/O buffering usually results in significantly improved I/O in general.
You definitely want to use asynchronous writes - that means your server starts the data writing, and then goes back to responding to its clients while the OS deals with writing the data to disk in the background.
There shouldn't be any need to queue the writes (as the OS will already be doing that if disc caching is enabled), but that is something you could try if all else fails - it could potentially help by writing only one file at a time to minimise the need for disk seeks..
Generally for I/O, using larger buffers helps to increase your throughput. For example instead of writing each individual byte to the file in a loop, write a buffer-ful of data (ideally the entire file, for the sizes you mentioned) in one Write operation. This will minimise the overhead (instead of calling a write function for every byte, you call a function once for the entire file). I suspect you may be doing something like this, as it's the only way I know to reduce performance to the levels you've suggested you are getting.
Memory-mapped files will not help you. They're really best for accessing the contents of huge files.
One of buggest and significant improvements, in your case, can be, imo, process the filles without saving them to a disk and after, if you really need to store them, push them on Queue and provess it in another thread, by saving them on disk. By doing this you will immidiately get processed data you need, without losing time to save a data on disk, but also will have a file on disk after, without losing computational power of your file processor.

What's the best way to read and parse a large text file over the network?

I have a problem which requires me to parse several log files from a remote machine.
There are a few complications:
1) The file may be in use
2) The files can be quite large (100mb+)
3) Each entry may be multi-line
To solve the in-use issue, I need to copy it first. I'm currently copying it directly from the remote machine to the local machine, and parsing it there. That leads to issue 2. Since the files are quite large copying it locally can take quite a while.
To enhance parsing time, I'd like to make the parser multi-threaded, but that makes dealing with multi-lined entries a bit trickier.
The two main issues are:
1) How do i speed up the file transfer (Compression?, Is transferring locally even neccessary?, Can I read an in use file some other way?)
2) How do i deal with multi-line entries when splitting up the lines among threads?
UPDATE: The reason I didnt do the obvious parse on the server reason is that I want to have as little cpu impact as possible. I don't want to affect the performance of the system im testing.
If you are reading a sequential file you want to read it in line by line over the network. You need a transfer method capable of streaming. You'll need to review your IO streaming technology to figure this out.
Large IO operations like this won't benefit much by multithreading since you can probably process the items as fast as you can read them over the network.
Your other great option is to put the log parser on the server, and download the results.
The better option, from the perspective of performance, is going to be to perform your parsing at the remote server. Apart from exceptional circumstances the speed of your network is always going to be the bottleneck, so limiting the amount of data that you send over your network is going to greatly improve performance.
This is one of the reasons that so many databases use stored procedures that are run at the server end.
Improvements in parsing speed (if any) through the use of multithreading are going to be swamped by the comparative speed of your network transfer.
If you're committed to transferring your files before parsing them, an option that you could consider is the use of on-the-fly compression while doing your file transfer.
There are, for example, sftp servers available that will perform compression on the fly.
At the local end you could use something like libcurl to do the client side of the transfer, which also supports on-the-fly decompression.
The easiest way considering you are already copying the file would be to compress it before copying, and decompress once copying is complete. You will get huge gains compressing text files because zip algorithms generally work very well on them. Also your existing parsing logic could be kept intact rather than having to hook it up to a remote network text reader.
The disadvantage of this method is that you won't be able to get line by line updates very efficiently, which are a good thing to have for a log parser.
I guess it depends on how "remote" it is. 100MB on a 100Mb LAN would be about 8 secs...up it to gigabit, and you'd have it in around 1 second. $50 * 2 for the cards, and $100 for a switch would be a very cheap upgrade you could do.
But, assuming it's further away than that, you should be able to open it with just read mode (as you're reading it when you're copying it). SMB/CIFS supports file block reading, so you should be streaming the file at that point (of course, you didn't actually say how you were accessing the file - I'm just assuming SMB).
Multithreading won't help, as you'll be disk or network bound anyway.
Use compression for transfer.
If your parsing is really slowing you down, and you have multiple processors, you can break the parsing job up, you just have to do it in a smart way -- have a deterministic algorithm for which workers are responsible for dealing with incomplete records. Assuming you can determine that a line is part of a middle of a record, for example, you could break the file into N/M segments, each responsible for M lines; when one of the jobs determines that its record is not finished, it just has to read on until it reaches the end of the record. When one of the jobs determines that it's reading a record for which it doesn't have a beginning, it should skip the record.
If you can copy the file, you can read it. So there's no need to copy it in the first place.
EDIT: use the FileStream class to have more control over the access and sharing modes.
new FileStream("logfile", FileMode.Open, FileAccess.Read, FileShare.ReadWrite)
should do the trick.
I've used SharpZipLib to compress large files before transferring them over the Internet. So that's one option.
Another idea for 1) would be to create an assembly that runs on the remote machine and does the parsing there. You could access the assembly from the local machine using .NET remoting. The remote assembly would need to be a Windows service or be hosted in IIS. That would allow you to keep your copies of the log files on the same machine, and in theory it would take less time to process them.
i think using compression (deflate/gzip) would help
The given answer do not satisfy me and maybe my answer will help others to not think it is super complicated or multithreading wouldn't benefit in such a scenario. Maybe it will not make the transfer faster but depending on the complexity of your parsing it may make the parsing/or analysis of the parsed data faster.
It really depends upon the details of your parsing. What kind of information do you need to get from the log files? Are these information like statistics or are they dependent on multiple log message?
You have several options:
parse multiple files at the same would be the easiest I guess, you have the file as context and can create one thread per file
another option as mentioned before is use compression for the network communication
you could also use a helper that splits the log file into lines that belong together as a first step and then with multiple threads process these blocks of lines; the parsing of this depend lines should be quite easy and fast.
Very important in such a scenario is to measure were your actual bottleneck is. If your bottleneck is the network you wont benefit of optimizing the parser too much. If your parser creates a lot of objects of the same kind you could use the ObjectPool pattern and create objects with multiple threads. Try to process the input without allocating too much new strings. Often parsers are written by using a lot of string.Split and so forth, that is not really as fast as it could be. You could navigate the Stream by checking the coming values without reading the complete string and splitting it again but directly fill the objects you will need after parsing is done.
Optimization is almost always possible, the question is how much you get out for how much input and how critical your scenario is.

Categories