Wondering if you could clarify.
I am writing a tool that all has todo is retrieve data from a database (sql server) and create txt files.
I am talking 500.000 txt files.
It's working and all is good.
However I was wondering if using Task Parallel library could improve and speed up the time it takes to create these files.
I know (read) that "TPL" is not meant to be used for I/0 bound processing and that most likely it will perform the same as
sequential .
Is this true?
Also in an initial attempt using a simple "foreach parallel" I was getting an error cannot access file because is in use.
Any advice?
You do not parallel I/O bound processes.
The reason is simple: because CPU is not the bottleneck. No matter you start how many threads, You only have ONE disk to write to, and that is the slowest thing.
So what you need to is to simply iterate every file and write them. You can start a seperate working thread doing this work, or using async I/O to get a better UI response.
If you read and/or write from multiple disks, then parallizing could improve speed. E.g if you want to read all your files and run a hash on them and store the hash, then you could create one thread per disk and you would see a significant speed up. However, if your case it seems like tasks are unlikely to improve performance.
Related
The Problem:
Download a batch of PDF files from pickup.fileserver (SFTP or windows share) to local hard drive (Polling is involved here to check if files are available to download)
Process (resize, apply barcodes etc) the PDF files, create some metadata files, update database etc
Upload this batch to dropoff.fileserver (SFTP)
Await response from dropoff.fileserver (Again polling is the only option). Once the batch response is available, download it local HD.
Parse the batch response, update database and finally upload report to pickup.fileserver
Archive all batch files to a SAN location and go back to step 1.
The Current Solution
We are expecting many such batches so we have created a windows service which can keep polling at certain time intervals and perform the steps mentioned above. It takes care of one batch at a time.
The Concern
The current solution works file, however, I'm concerned that it is NOT making best use of available resources, there is certainly a lot of room for improvement. I have very little idea about how I can scale this windows service to be able to process as many batches simultaneously as it can. And then if required, how to involve multiple instances of this windows service hosted on different servers to scale further.
I have read some MSDN articles and some SO answers on similar topics. There are suggestions about using producer-consumer patterns (BlockingCollectiong<T> etc.) Some say that it wouldn't make sense to create multi-threaded app for IO intensive tasks. What we have here is a mixture of disk + network + processor intensive tasks. I need to understand how best to use threading or any other technology to make best use of available resources on one server and go beyond one server (if required) to scale further.
Typical Batch Size
We regularly get batches of 200~ files, 300 MB~ total size. # of batches can grow to about 50 to 100, in next year or two. A couple of times in a year, we get batches of 5k to 10k files.
As you say, what you have is a mixture of tasks, and it's probably going to be hard to implement a single pipeline that optimizes all your resources. I would look at breaking this down into 6 services (one per step) that can then be tuned, multiplied or multi-threaded to provide the throughput you need.
Your sources are probably correct that you're not going to improve performance of your network tasks much by multithreading them. By breaking your application into several services, your resizing and barcoding service can start processing a file as soon as it's done downloading, while the download service moves on to downloading the next file.
The current solution works fine
Then keep it. That's my $0.02. Who cares if it's not terribly efficient? As long as it is efficient enough, then why change it?
That said...
I need to understand how best to use threading or any other technology to make best use of available resources on one server
If you want a new toy, I'd recommend using TPL Dataflow. It is designed specifically for wiring up pipelines that contain a mixture of I/O-bound and CPU-bound steps. Each step can be independently parallelized, and TPL Dataflow blocks understand asynchronous code, so they also work well with I/O.
and go beyond one server (if required) to scale further.
That's a totally different question. You'd need to use reliable queues and break the different steps into different processes, which can then run anywhere. This is a good place to start.
According to this article you may implement background worker jobs (Hangfire preferably) in your application layer and reduce code and deployment management of multiple windows services and achieve the same result possibly.
Also, you won't need to bother about handling multiple windows services.
Additionally it can restore in case of failure at application level or restart events.
There is no magic technology that will solve your problem, you need to analyse each part of it step by step.
You will need to profile the application and determine what areas are slow performing and refactor the code to resolve the problem.
This might mean increasing the demand on one resource to decrease demand on another, for example: You might find that you are doing a database lookup 10 times for each file you process. But caching the data before starting processing files is quicker, but maybe only if you have a batch larger than xx files.
You might find that to increase the processing speed of the whole batch that this is maybe not the optimal method for a single file.
As your program has multiple steps then you can look at each of these in turn, and as a whole.
My guess would be that the ftp download and upload would take the most time. So, you can look at running this in parallel. Whether this means running xx threads at once each processing a file, or having a separate task/thread for each stage in your process you can only determine with testing.
A good design is critical for performance. But there are limits and sometimes it just takes time to do some tasks.
Don’t forget that you must weight this up against the time and effort needed to implement this and the benefit. If the service runs overnight and takes 6 hours to run is it really a benefit if it takes 4 hours, if the people who need to work on the result will not be in the office anyway until much later.
To this kind of problem do you have the any specific file types that you download from the SFTP. I have a similar problem in downloading the large files but it is not a windows service in my case its EXE that runs on the System.timers.
Try to create the threads for each file types which are large in
size eg: PDF's.
You can check for these file types while downloading the SFTP file
path and assign them to a thread process to download.
You also need to upload the files also in vice versa.
--In my case all I was able to do was to tweak the existing one and create a separate thread process for a large file types. that solved my problem as flat files and Large PDF files are downloaded parallel threads.
I've an application in C# which involves a lot of file operations, i.e., reading, moving, deleting, appending, etc. For Example, a file is read from a source path on local FS and after processing, it is deleted from there and the processed file is written to target location on local FS. This is all done parallelly on a group of systems with each working only on the local files. (Files were distributed among them by the load balancer)
How can I possibly improve the performance of this application?
Things that I can think of are:
1.) Create a queue for a particular type of operation such as delete. Put the required info in the queue and a separate thread will be processing the queue.
2.) Instead of working on FS, use a in-memory Data store such as Redis. As the data will be in cache, operations will be faster.
3.) Increasing the parallelism of the code. Each thread will be working on separate file and should be faster.
Will the above approaches work? Please suggest any other alternatives that might be worth giving a thought.
1.) I would suggest batching together common context operations to reduce sync\context switching overhead and take advantage of the caching mechanism of your processor.
2.) Grouping files together into a single file will reduce windows's handshake per file performance penalty.
3.) Try using of pointer and\or Win32 API that in many cases appears to be faster than their managed wrappers\lib implementations.
4.) Blocking collection queues (Producer consumer) can be a good starting point.
I Have a Parallel.Foreach Loop creating Binary Readers on the same group of large Data Files
I was just wondering if it hurts performance that these readers are reading the same files in a Parallel Fashion (i.e, if they were reading exclusively different files would it go faster ?)
I am asking because there is a lot of I/O Disk access involved (I guess...)
Edit : I forgot to mention : I am using an Amazon EC2 instance and data is on the C:\ Disk assigned to it. I have no Idea how it affects this issue.
Edit 2: I'll make measurements duplicating the data folder and reading from 2 different sources and see what it gives.
It's not a good idea to read from the same disk using multiple threads. Since the disk's mechanical head needs to spin every time to seek the next reading location, you are basically bouncing it around with multiple threads, thus hurting performance.
The best approach is actually to read the files sequentially using a single thread and then handing the chunks to a group of threads to process them in parallel.
It depends on where your files are. If you're using one mechanical hard-disk, then no - don't read files in parallel, it's going to hurt performance. You may have other configurations, though:
On a single SDD, reading files in parallel will probably not hurt performance, but I don't expect you'll gain anything.
On two mirrored disks using RAID 1 and a half-decent RAID controller, you can read two files at once and gain considerable performance.
If your files are stored on a SAN, you can most definitely read a few at a time and improve performance.
You'll have to try it, but you have to be careful with this - if the files aren't large enough, the OS caching mechanisms are going to affect your measurements, and the second test run is going to be really fast.
I wrote a script to read a 100mb+ text file using a single thread and multiple threads. The multi-threaded script shares the same StreamReader, and locks it during the StreamReader.ReadLine() call. After timing my two scripts, they are about the same speed (it seems that the ReadLine() is what's taking up most of the run-time).
Where can I take this next? I'm thinking of splitting the source file into multiple text files so each thread can work with its own StreamReader, but that seems a bit cumbersome. Is there a better way to speed up my process?
Thanks!
With a single hard-disk, there's not much you can do except use a single producer (to read files) multiple consumer (for processing) model. A hard disk needs to move the mechanical "head" in order to seek the next reading position. Multiple threads doing this will just bounce the head around and not bring any speedup (worse, in some cases it may be slower).
Splitting the input file is even worse, because now the file chunks are no longer consecutive and need further seeking.
So use a single thread to read chunks of the large file and either put the tasks in a synchronized queue (e.g. ConcurrentQueue) for multiple consumer threads or use QueueUserWorkItem to access the built-in thread pool.
Where can you take this next?
Add multiple HDDs then have 1 thread per HDD. Split your file across the HDDs. Kinda like RAID.
EDIT:
Similar questions have been asked many times here. Just use 1 threads to read file and 1 thread to process. No multithreading needed.
I am trying to write to different pieces of a large file using multiple threads, just like a segmented file downloader would do.
My question is, what is the safe way to do this? Do I open the file for writing, create my threads, passing the Stream object to each thread? I don't want an error to occur because multiple threads are accessing the same object at potentially the same time.
This is C# by the way.
I would personally suggest that you fetch the data in multiple threads, but actually write to it from a single thread. It's likely to be considerably simpler that way. You could use a producer/consumer queue (which is really easy in .NET 4) and then each producer would feed pairs of "index, data". The consumer thread could then just sequentially seek, write, seek, write etc.
If this were Linux programming, I would recommend you look into the pwrite() command, which writes a buffer to a file at a given offset. A cursory search of C# documentation doesn't turn up anything like this however. Does anyone know if a similar function exists?
Although one might be able to open multiple streams pointing to the same file, and use a different stream in each thread, I would second the advice of using a single thread for the writing absent some reason to do otherwise. Even if two or more threads can safely write to the same file simultaneously, that doesn't mean it's a good idea. It may be helpful to have the unified thread attempt to sequence writes in a sensible order to avoid lots of random seeking; the performance benefit from that would depend upon how effectively the OS could cache and schedule random writes. Don't go crazy optimizing such things if it turns out the OS does a good job, but be prepared to add some optimization if the OS default behavior turns out to perform poorly.