Optimizing File Operations

Optimizing File Operations - c#

I've an application in C# which involves a lot of file operations, i.e., reading, moving, deleting, appending, etc. For Example, a file is read from a source path on local FS and after processing, it is deleted from there and the processed file is written to target location on local FS. This is all done parallelly on a group of systems with each working only on the local files. (Files were distributed among them by the load balancer)
How can I possibly improve the performance of this application?
Things that I can think of are:
1.) Create a queue for a particular type of operation such as delete. Put the required info in the queue and a separate thread will be processing the queue.
2.) Instead of working on FS, use a in-memory Data store such as Redis. As the data will be in cache, operations will be faster.
3.) Increasing the parallelism of the code. Each thread will be working on separate file and should be faster.
Will the above approaches work? Please suggest any other alternatives that might be worth giving a thought.

1.) I would suggest batching together common context operations to reduce sync\context switching overhead and take advantage of the caching mechanism of your processor.
2.) Grouping files together into a single file will reduce windows's handshake per file performance penalty.
3.) Try using of pointer and\or Win32 API that in many cases appears to be faster than their managed wrappers\lib implementations.
4.) Blocking collection queues (Producer consumer) can be a good starting point.

Related

How to use multithreading or any other .NET technology to scale a program performing network, disk and processor intensive jobs?

The Problem:
Download a batch of PDF files from pickup.fileserver (SFTP or windows share) to local hard drive (Polling is involved here to check if files are available to download)
Process (resize, apply barcodes etc) the PDF files, create some metadata files, update database etc
Upload this batch to dropoff.fileserver (SFTP)
Await response from dropoff.fileserver (Again polling is the only option). Once the batch response is available, download it local HD.
Parse the batch response, update database and finally upload report to pickup.fileserver
Archive all batch files to a SAN location and go back to step 1.
The Current Solution
We are expecting many such batches so we have created a windows service which can keep polling at certain time intervals and perform the steps mentioned above. It takes care of one batch at a time.
The Concern
The current solution works file, however, I'm concerned that it is NOT making best use of available resources, there is certainly a lot of room for improvement. I have very little idea about how I can scale this windows service to be able to process as many batches simultaneously as it can. And then if required, how to involve multiple instances of this windows service hosted on different servers to scale further.
I have read some MSDN articles and some SO answers on similar topics. There are suggestions about using producer-consumer patterns (BlockingCollectiong<T> etc.) Some say that it wouldn't make sense to create multi-threaded app for IO intensive tasks. What we have here is a mixture of disk + network + processor intensive tasks. I need to understand how best to use threading or any other technology to make best use of available resources on one server and go beyond one server (if required) to scale further.
Typical Batch Size
We regularly get batches of 200~ files, 300 MB~ total size. # of batches can grow to about 50 to 100, in next year or two. A couple of times in a year, we get batches of 5k to 10k files.

As you say, what you have is a mixture of tasks, and it's probably going to be hard to implement a single pipeline that optimizes all your resources. I would look at breaking this down into 6 services (one per step) that can then be tuned, multiplied or multi-threaded to provide the throughput you need.
Your sources are probably correct that you're not going to improve performance of your network tasks much by multithreading them. By breaking your application into several services, your resizing and barcoding service can start processing a file as soon as it's done downloading, while the download service moves on to downloading the next file.

The current solution works fine
Then keep it. That's my $0.02. Who cares if it's not terribly efficient? As long as it is efficient enough, then why change it?
That said...
I need to understand how best to use threading or any other technology to make best use of available resources on one server
If you want a new toy, I'd recommend using TPL Dataflow. It is designed specifically for wiring up pipelines that contain a mixture of I/O-bound and CPU-bound steps. Each step can be independently parallelized, and TPL Dataflow blocks understand asynchronous code, so they also work well with I/O.
and go beyond one server (if required) to scale further.
That's a totally different question. You'd need to use reliable queues and break the different steps into different processes, which can then run anywhere. This is a good place to start.

According to this article you may implement background worker jobs (Hangfire preferably) in your application layer and reduce code and deployment management of multiple windows services and achieve the same result possibly.
Also, you won't need to bother about handling multiple windows services.
Additionally it can restore in case of failure at application level or restart events.

There is no magic technology that will solve your problem, you need to analyse each part of it step by step.
You will need to profile the application and determine what areas are slow performing and refactor the code to resolve the problem.
This might mean increasing the demand on one resource to decrease demand on another, for example: You might find that you are doing a database lookup 10 times for each file you process. But caching the data before starting processing files is quicker, but maybe only if you have a batch larger than xx files.
You might find that to increase the processing speed of the whole batch that this is maybe not the optimal method for a single file.
As your program has multiple steps then you can look at each of these in turn, and as a whole.
My guess would be that the ftp download and upload would take the most time. So, you can look at running this in parallel. Whether this means running xx threads at once each processing a file, or having a separate task/thread for each stage in your process you can only determine with testing.
A good design is critical for performance. But there are limits and sometimes it just takes time to do some tasks.
Don’t forget that you must weight this up against the time and effort needed to implement this and the benefit. If the service runs overnight and takes 6 hours to run is it really a benefit if it takes 4 hours, if the people who need to work on the result will not be in the office anyway until much later.

To this kind of problem do you have the any specific file types that you download from the SFTP. I have a similar problem in downloading the large files but it is not a windows service in my case its EXE that runs on the System.timers.
Try to create the threads for each file types which are large in
size eg: PDF's.
You can check for these file types while downloading the SFTP file
path and assign them to a thread process to download.
You also need to upload the files also in vice versa.
--In my case all I was able to do was to tweak the existing one and create a separate thread process for a large file types. that solved my problem as flat files and Large PDF files are downloaded parallel threads.

use Task parallel library for I/O bound processing

Wondering if you could clarify.
I am writing a tool that all has todo is retrieve data from a database (sql server) and create txt files.
I am talking 500.000 txt files.
It's working and all is good.
However I was wondering if using Task Parallel library could improve and speed up the time it takes to create these files.
I know (read) that "TPL" is not meant to be used for I/0 bound processing and that most likely it will perform the same as
sequential .
Is this true?
Also in an initial attempt using a simple "foreach parallel" I was getting an error cannot access file because is in use.
Any advice?

You do not parallel I/O bound processes.
The reason is simple: because CPU is not the bottleneck. No matter you start how many threads, You only have ONE disk to write to, and that is the slowest thing.
So what you need to is to simply iterate every file and write them. You can start a seperate working thread doing this work, or using async I/O to get a better UI response.

If you read and/or write from multiple disks, then parallizing could improve speed. E.g if you want to read all your files and run a hash on them and store the hash, then you could create one thread per disk and you would see a significant speed up. However, if your case it seems like tasks are unlikely to improve performance.

Parallel Concurrent Binary Readers

I Have a Parallel.Foreach Loop creating Binary Readers on the same group of large Data Files
I was just wondering if it hurts performance that these readers are reading the same files in a Parallel Fashion (i.e, if they were reading exclusively different files would it go faster ?)
I am asking because there is a lot of I/O Disk access involved (I guess...)
Edit : I forgot to mention : I am using an Amazon EC2 instance and data is on the C:\ Disk assigned to it. I have no Idea how it affects this issue.
Edit 2: I'll make measurements duplicating the data folder and reading from 2 different sources and see what it gives.

It's not a good idea to read from the same disk using multiple threads. Since the disk's mechanical head needs to spin every time to seek the next reading location, you are basically bouncing it around with multiple threads, thus hurting performance.
The best approach is actually to read the files sequentially using a single thread and then handing the chunks to a group of threads to process them in parallel.

It depends on where your files are. If you're using one mechanical hard-disk, then no - don't read files in parallel, it's going to hurt performance. You may have other configurations, though:
On a single SDD, reading files in parallel will probably not hurt performance, but I don't expect you'll gain anything.
On two mirrored disks using RAID 1 and a half-decent RAID controller, you can read two files at once and gain considerable performance.
If your files are stored on a SAN, you can most definitely read a few at a time and improve performance.
You'll have to try it, but you have to be careful with this - if the files aren't large enough, the OS caching mechanisms are going to affect your measurements, and the second test run is going to be really fast.

how to improve a large number of smaller files read and write speed or performance

Yesterday,I asked the question at here:how do disable disk cache in c# invoke win32 CreateFile api with FILE_FLAG_NO_BUFFERING.
In my performance test show(write and read test,1000 files and total size 220M),the FILE_FLAG_NO_BUFFERING can't help me improve performance and lower than .net default disk cache,since i try change FILE_FLAG_NO_BUFFERING to FILE_FLAG_SEQUENTIAL_SCAN can to reach the .net default disk cache and faster little.
before,i try use mongodb's gridfs feature replace the windows file system，not good(and i don't need to use distributed feature,just taste).
in my Product,the server can get a lot of the smaller files(60-100k) on per seconds through tcp/ip,then need save it to the disk,and third service read these files once(just read once and process).if i use asynchronous I/O whether can help me,whether can get best speed and best low cpu cycle?. someone can give me suggestion?or i can still use FileStream class?
update 1
the memory mapped file whether can to achieve my demand.that all files write to one big file or more and read from it?

If your PC is taking 5-10 seconds to write a 100kB file to disk, then you either have the world's oldest, slowest PC, or your code is doing something very inefficient.
Turning off disk caching will probably make things worse rather than better. With a disk cache in place, your writes will be fast, and Windows will do the slow part of flushing the data to disk later. Indeed, increasing I/O buffering usually results in significantly improved I/O in general.
You definitely want to use asynchronous writes - that means your server starts the data writing, and then goes back to responding to its clients while the OS deals with writing the data to disk in the background.
There shouldn't be any need to queue the writes (as the OS will already be doing that if disc caching is enabled), but that is something you could try if all else fails - it could potentially help by writing only one file at a time to minimise the need for disk seeks..
Generally for I/O, using larger buffers helps to increase your throughput. For example instead of writing each individual byte to the file in a loop, write a buffer-ful of data (ideally the entire file, for the sizes you mentioned) in one Write operation. This will minimise the overhead (instead of calling a write function for every byte, you call a function once for the entire file). I suspect you may be doing something like this, as it's the only way I know to reduce performance to the levels you've suggested you are getting.
Memory-mapped files will not help you. They're really best for accessing the contents of huge files.

One of buggest and significant improvements, in your case, can be, imo, process the filles without saving them to a disk and after, if you really need to store them, push them on Queue and provess it in another thread, by saving them on disk. By doing this you will immidiately get processed data you need, without losing time to save a data on disk, but also will have a file on disk after, without losing computational power of your file processor.

What's the best way to read and parse a large text file over the network?

I have a problem which requires me to parse several log files from a remote machine.
There are a few complications:
1) The file may be in use
2) The files can be quite large (100mb+)
3) Each entry may be multi-line
To solve the in-use issue, I need to copy it first. I'm currently copying it directly from the remote machine to the local machine, and parsing it there. That leads to issue 2. Since the files are quite large copying it locally can take quite a while.
To enhance parsing time, I'd like to make the parser multi-threaded, but that makes dealing with multi-lined entries a bit trickier.
The two main issues are:
1) How do i speed up the file transfer (Compression?, Is transferring locally even neccessary?, Can I read an in use file some other way?)
2) How do i deal with multi-line entries when splitting up the lines among threads?
UPDATE: The reason I didnt do the obvious parse on the server reason is that I want to have as little cpu impact as possible. I don't want to affect the performance of the system im testing.

If you are reading a sequential file you want to read it in line by line over the network. You need a transfer method capable of streaming. You'll need to review your IO streaming technology to figure this out.
Large IO operations like this won't benefit much by multithreading since you can probably process the items as fast as you can read them over the network.
Your other great option is to put the log parser on the server, and download the results.

The better option, from the perspective of performance, is going to be to perform your parsing at the remote server. Apart from exceptional circumstances the speed of your network is always going to be the bottleneck, so limiting the amount of data that you send over your network is going to greatly improve performance.
This is one of the reasons that so many databases use stored procedures that are run at the server end.
Improvements in parsing speed (if any) through the use of multithreading are going to be swamped by the comparative speed of your network transfer.
If you're committed to transferring your files before parsing them, an option that you could consider is the use of on-the-fly compression while doing your file transfer.
There are, for example, sftp servers available that will perform compression on the fly.
At the local end you could use something like libcurl to do the client side of the transfer, which also supports on-the-fly decompression.

The easiest way considering you are already copying the file would be to compress it before copying, and decompress once copying is complete. You will get huge gains compressing text files because zip algorithms generally work very well on them. Also your existing parsing logic could be kept intact rather than having to hook it up to a remote network text reader.
The disadvantage of this method is that you won't be able to get line by line updates very efficiently, which are a good thing to have for a log parser.

I guess it depends on how "remote" it is. 100MB on a 100Mb LAN would be about 8 secs...up it to gigabit, and you'd have it in around 1 second. $50 * 2 for the cards, and $100 for a switch would be a very cheap upgrade you could do.
But, assuming it's further away than that, you should be able to open it with just read mode (as you're reading it when you're copying it). SMB/CIFS supports file block reading, so you should be streaming the file at that point (of course, you didn't actually say how you were accessing the file - I'm just assuming SMB).
Multithreading won't help, as you'll be disk or network bound anyway.

Use compression for transfer.
If your parsing is really slowing you down, and you have multiple processors, you can break the parsing job up, you just have to do it in a smart way -- have a deterministic algorithm for which workers are responsible for dealing with incomplete records. Assuming you can determine that a line is part of a middle of a record, for example, you could break the file into N/M segments, each responsible for M lines; when one of the jobs determines that its record is not finished, it just has to read on until it reaches the end of the record. When one of the jobs determines that it's reading a record for which it doesn't have a beginning, it should skip the record.

If you can copy the file, you can read it. So there's no need to copy it in the first place.
EDIT: use the FileStream class to have more control over the access and sharing modes.
new FileStream("logfile", FileMode.Open, FileAccess.Read, FileShare.ReadWrite)
should do the trick.

I've used SharpZipLib to compress large files before transferring them over the Internet. So that's one option.
Another idea for 1) would be to create an assembly that runs on the remote machine and does the parsing there. You could access the assembly from the local machine using .NET remoting. The remote assembly would need to be a Windows service or be hosted in IIS. That would allow you to keep your copies of the log files on the same machine, and in theory it would take less time to process them.

i think using compression (deflate/gzip) would help

The given answer do not satisfy me and maybe my answer will help others to not think it is super complicated or multithreading wouldn't benefit in such a scenario. Maybe it will not make the transfer faster but depending on the complexity of your parsing it may make the parsing/or analysis of the parsed data faster.
It really depends upon the details of your parsing. What kind of information do you need to get from the log files? Are these information like statistics or are they dependent on multiple log message?
You have several options:
parse multiple files at the same would be the easiest I guess, you have the file as context and can create one thread per file
another option as mentioned before is use compression for the network communication
you could also use a helper that splits the log file into lines that belong together as a first step and then with multiple threads process these blocks of lines; the parsing of this depend lines should be quite easy and fast.
Very important in such a scenario is to measure were your actual bottleneck is. If your bottleneck is the network you wont benefit of optimizing the parser too much. If your parser creates a lot of objects of the same kind you could use the ObjectPool pattern and create objects with multiple threads. Try to process the input without allocating too much new strings. Often parsers are written by using a lot of string.Split and so forth, that is not really as fast as it could be. You could navigate the Stream by checking the coming values without reading the complete string and splitting it again but directly fill the objects you will need after parsing is done.
Optimization is almost always possible, the question is how much you get out for how much input and how critical your scenario is.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.