best way to handle multiple streams to same file - c#

I am trying to create a binary storage. Files once added, can't be deleted. All files are stored in the same master file (storage.bin). i will have a table index with pairs of (key, startFilePos, fileLength).
I am struggling to find the best solution before i start coding. As i am not that familiar with streams under .net, i would like some advice on them.
are multiple streams to the same file (storage.bin) a solution?
As files may be added in parallel, once i start receiving a file, i
reserve the space on disk for it and start writing that specific
file to the reserved block while other same-scenario files are added
on other threads.
if (1) is a viable solution, is it the optimal one? if not, could
you point to a better one?
what is the best way to allocate threads for every file-add scenarios? It seems like 1 thread/ 1 stream isn't a viable solution. What i was thinking is to have a a thread pool and if the limit is reached, have a new file wait for one to be available.

are multiple streams to the same file (storage.bin) a solution? As files may be added in parallel, once i start receiving a file, i
reserve the space on disk for it and start writing that specific file
to the reserved block while other same-scenario files are added on
other threads.
Yes, this could be a solution. As long as the different streams are writing to different non-overlapping segments of the file there will be no problems.
if (1) is a viable solution, is it the optimal one? if not, could you point to a better one?
Use separate files and keep the index in the master file.
what is the best way to allocate threads for every file-add scenarios? It seems like 1 thread/ 1 stream isn't a viable solution.
What i was thinking is to have a a thread pool and if the limit is
reached, have a new file wait for one to be available
You've just described how a thread pool works. Before rolling your own, make sure you have checked the ThreadPool class built directly into the framework. Or a custom concurrency limiting task scheduler.

Related

How to use multithreading or any other .NET technology to scale a program performing network, disk and processor intensive jobs?

The Problem:
Download a batch of PDF files from pickup.fileserver (SFTP or windows share) to local hard drive (Polling is involved here to check if files are available to download)
Process (resize, apply barcodes etc) the PDF files, create some metadata files, update database etc
Upload this batch to dropoff.fileserver (SFTP)
Await response from dropoff.fileserver (Again polling is the only option). Once the batch response is available, download it local HD.
Parse the batch response, update database and finally upload report to pickup.fileserver
Archive all batch files to a SAN location and go back to step 1.
The Current Solution
We are expecting many such batches so we have created a windows service which can keep polling at certain time intervals and perform the steps mentioned above. It takes care of one batch at a time.
The Concern
The current solution works file, however, I'm concerned that it is NOT making best use of available resources, there is certainly a lot of room for improvement. I have very little idea about how I can scale this windows service to be able to process as many batches simultaneously as it can. And then if required, how to involve multiple instances of this windows service hosted on different servers to scale further.
I have read some MSDN articles and some SO answers on similar topics. There are suggestions about using producer-consumer patterns (BlockingCollectiong<T> etc.) Some say that it wouldn't make sense to create multi-threaded app for IO intensive tasks. What we have here is a mixture of disk + network + processor intensive tasks. I need to understand how best to use threading or any other technology to make best use of available resources on one server and go beyond one server (if required) to scale further.
Typical Batch Size
We regularly get batches of 200~ files, 300 MB~ total size. # of batches can grow to about 50 to 100, in next year or two. A couple of times in a year, we get batches of 5k to 10k files.
As you say, what you have is a mixture of tasks, and it's probably going to be hard to implement a single pipeline that optimizes all your resources. I would look at breaking this down into 6 services (one per step) that can then be tuned, multiplied or multi-threaded to provide the throughput you need.
Your sources are probably correct that you're not going to improve performance of your network tasks much by multithreading them. By breaking your application into several services, your resizing and barcoding service can start processing a file as soon as it's done downloading, while the download service moves on to downloading the next file.
The current solution works fine
Then keep it. That's my $0.02. Who cares if it's not terribly efficient? As long as it is efficient enough, then why change it?
That said...
I need to understand how best to use threading or any other technology to make best use of available resources on one server
If you want a new toy, I'd recommend using TPL Dataflow. It is designed specifically for wiring up pipelines that contain a mixture of I/O-bound and CPU-bound steps. Each step can be independently parallelized, and TPL Dataflow blocks understand asynchronous code, so they also work well with I/O.
and go beyond one server (if required) to scale further.
That's a totally different question. You'd need to use reliable queues and break the different steps into different processes, which can then run anywhere. This is a good place to start.
According to this article you may implement background worker jobs (Hangfire preferably) in your application layer and reduce code and deployment management of multiple windows services and achieve the same result possibly.
Also, you won't need to bother about handling multiple windows services.
Additionally it can restore in case of failure at application level or restart events.
There is no magic technology that will solve your problem, you need to analyse each part of it step by step.
You will need to profile the application and determine what areas are slow performing and refactor the code to resolve the problem.
This might mean increasing the demand on one resource to decrease demand on another, for example: You might find that you are doing a database lookup 10 times for each file you process. But caching the data before starting processing files is quicker, but maybe only if you have a batch larger than xx files.
You might find that to increase the processing speed of the whole batch that this is maybe not the optimal method for a single file.
As your program has multiple steps then you can look at each of these in turn, and as a whole.
My guess would be that the ftp download and upload would take the most time. So, you can look at running this in parallel. Whether this means running xx threads at once each processing a file, or having a separate task/thread for each stage in your process you can only determine with testing.
A good design is critical for performance. But there are limits and sometimes it just takes time to do some tasks.
Don’t forget that you must weight this up against the time and effort needed to implement this and the benefit. If the service runs overnight and takes 6 hours to run is it really a benefit if it takes 4 hours, if the people who need to work on the result will not be in the office anyway until much later.
To this kind of problem do you have the any specific file types that you download from the SFTP. I have a similar problem in downloading the large files but it is not a windows service in my case its EXE that runs on the System.timers.
Try to create the threads for each file types which are large in
size eg: PDF's.
You can check for these file types while downloading the SFTP file
path and assign them to a thread process to download.
You also need to upload the files also in vice versa.
--In my case all I was able to do was to tweak the existing one and create a separate thread process for a large file types. that solved my problem as flat files and Large PDF files are downloaded parallel threads.

Having multiple simultaneous writers (no reader) to a single file. Is it possible to accomplish in a performant way in .NET?

I'm developing a multiple segment file downloader. To accomplish this task I'm currently creating as many temporary files on disk as segments I have (they are fixed in number during the file downloading). In the end I just create a new file f and copy all the segments' contents onto f.
I was wondering if there's not a better way to accomplish this. My idealization is of initially creating f in its full-size and then have the different threads write directly onto their portion. There need not to be any kind of interaction between them. We can assume any of them will start at its own starting point in the file and then only fill information sequentially in the file until its task is over.
I've heard about Memory-Mapped files (http://msdn.microsoft.com/en-us/library/dd997372(v=vs.110).aspx) and I'm wondering if they are the solution to my problem or not.
Thanks
Using the memory mapped API is absolute doable and it will probably perform quite well - of cause some testing would be recommended.
If you want to look for a possible alternative implementation, I have the following suggestion.
Create a static stack data structure, where the download threads can push each file segment as soon as it's downloaded.
Have a separate thread listen for push notifications on the stack. Pop the stack file segments and save each segment into the target file in a single threaded way.
By following the above pattern, you have separated the download of file segments and the saving into a regular file, by putting a stack container in between.
Depending on the implementation of the stack handling, you will be able to implement this with very little thread locking, which will maximise performance.
The pros of this is that you have 100% control on what is going on and a solution that might be more portable (if that ever should be a concern).
The stack decoupling pattern you do, can also be implemented pretty generic and might even be reused in the future.
The implementation of this is not that complex and probably on par with the implementation needed to be done around the memory mapped api.
Have fun...
/Anders
The answers posted so far are, of course addressing your question but you should also consider the fact that multi-threaded I/O writes will most likely NOT give you gains in performance.
The reason for multi-threading downloads is obvious and has dramatic results. When you try to combine the files though, remember that you are having multiple threads manipulate a mechanical head on conventional hard drives. In case of SSD's you may gain better performance.
If you use a single thread, you are by far exceeding the HDD's write capacity in a SEQUENTIAL way. That IS by definition the fastest way to write to conventions disks.
If you believe otherwise, I would be interested to know why. I would rather concentrate on tweaking the write performance of a single thread by playing around with buffer sizes, etc.
Yes, it is possible but the only precaution you need to have is to control that no two threads are writing at the same location of file, otherwise file content will be incorrect.
FileStream writeStream = new FileStream(destinationPath, FileMode.OpenOrCreate, FileAccess.Write, FileShare.Write);
writeStream.Position = startPositionOfSegments; //REMEMBER This piece of calculation is important
// A simple function to write the bytes ... just read from your source and then write
writeStream.Write(ReadBytes, 0 , bytesReadFromInputStream);
After each Write we used writeStream.Flush(); so that buffered data gets written to file but you can change according to your requirement.
Since you have code already working which downloads the file segments in parallel. The only change you need to make is just open the file stream as posted above, and instead of creating many segments file locally just open stream for a single file.
The startPositionOfSegments is very important and calculate it perfectly so that no two segments overwrite the desired downloaded bytes to same location on file otherwise it will provide incorrect result.
The above procedure works perfectly fine at our end, but this can be problem if your segment size are too small (We too faced it but after increasing size of segments it got fixed). If you face any exception then you can also synchronize only the Write part.

Quicker file reading using multi-threading?

I wrote a script to read a 100mb+ text file using a single thread and multiple threads. The multi-threaded script shares the same StreamReader, and locks it during the StreamReader.ReadLine() call. After timing my two scripts, they are about the same speed (it seems that the ReadLine() is what's taking up most of the run-time).
Where can I take this next? I'm thinking of splitting the source file into multiple text files so each thread can work with its own StreamReader, but that seems a bit cumbersome. Is there a better way to speed up my process?
Thanks!
With a single hard-disk, there's not much you can do except use a single producer (to read files) multiple consumer (for processing) model. A hard disk needs to move the mechanical "head" in order to seek the next reading position. Multiple threads doing this will just bounce the head around and not bring any speedup (worse, in some cases it may be slower).
Splitting the input file is even worse, because now the file chunks are no longer consecutive and need further seeking.
So use a single thread to read chunks of the large file and either put the tasks in a synchronized queue (e.g. ConcurrentQueue) for multiple consumer threads or use QueueUserWorkItem to access the built-in thread pool.
Where can you take this next?
Add multiple HDDs then have 1 thread per HDD. Split your file across the HDDs. Kinda like RAID.
EDIT:
Similar questions have been asked many times here. Just use 1 threads to read file and 1 thread to process. No multithreading needed.

Multithreaded file writing

I am trying to write to different pieces of a large file using multiple threads, just like a segmented file downloader would do.
My question is, what is the safe way to do this? Do I open the file for writing, create my threads, passing the Stream object to each thread? I don't want an error to occur because multiple threads are accessing the same object at potentially the same time.
This is C# by the way.
I would personally suggest that you fetch the data in multiple threads, but actually write to it from a single thread. It's likely to be considerably simpler that way. You could use a producer/consumer queue (which is really easy in .NET 4) and then each producer would feed pairs of "index, data". The consumer thread could then just sequentially seek, write, seek, write etc.
If this were Linux programming, I would recommend you look into the pwrite() command, which writes a buffer to a file at a given offset. A cursory search of C# documentation doesn't turn up anything like this however. Does anyone know if a similar function exists?
Although one might be able to open multiple streams pointing to the same file, and use a different stream in each thread, I would second the advice of using a single thread for the writing absent some reason to do otherwise. Even if two or more threads can safely write to the same file simultaneously, that doesn't mean it's a good idea. It may be helpful to have the unified thread attempt to sequence writes in a sensible order to avoid lots of random seeking; the performance benefit from that would depend upon how effectively the OS could cache and schedule random writes. Don't go crazy optimizing such things if it turns out the OS does a good job, but be prepared to add some optimization if the OS default behavior turns out to perform poorly.

Multicore Text File Parsing

I have a quad core machine and would like to write some code to parse a text file that takes advantage of all four cores. The text file basically contains one record per line.
Multithreading isn't my forte so I'm wondering if anyone could give me some patterns that I might be able to use to parse the file in an optimal manner.
My first thoughts are to read all the lines into some sort of queue and then spin up threads to pull the lines off the queue and process them, but that means the queue would have to exist in memory and these are fairly large files so I'm not so keen on that idea.
My next thoughts are to have some sort of controller that will read in a line and assign it a thread to parse, but I'm not sure if the controller will end up being a bottleneck if the threads are processing the lines faster than it can read and assign them.
I know there's probably another simpler solution than both of these but at the moment I'm just not seeing it.
I'd go with your original idea. If you are concerned that the queue might get too large implement a buffer-zone for it (i.e. If is gets above 100 lines the stop reading the file and if it gets below 20 then start reading again. You'd need to do some testing to find the optimal barriers). Make it so that any of the threads can potentially be the "reader thread" as it has to lock the queue to pull an item out anyway it can also check to see if the "low buffer region" has been hit and start reading again. While it's doing this the other threads can read out the rest of the queue.
Or if you prefer, have one reader thread assign the lines to three other processor threads (via their own queues) and implement a work-stealing strategy. I've never done this so I don't know how hard it is.
Mark's answer is the simpler, more elegant solution. Why build a complex program with inter-thread communication if it's not necessary? Spawn 4 threads. Each thread calculates size-of-file/4 to determine it's start point (and stop point). Each thread can then work entirely independently.
The only reason to add a special thread to handle reading is if you expect some lines to take a very long time to process and you expect that these lines are clustered in a single part of the file. Adding inter-thread communication when you don't need it is a very bad idea. You greatly increase the chance of introducing an unexpected bottleneck and/or synchronization bugs.
This will eliminate bottlenecks of having a single thread do the reading:
open file
for each thread n=0,1,2,3:
seek to file offset 1/n*filesize
scan to next complete line
process all lines in your part of the file
My experience is with Java, not C#, so apologies if these solutions don't apply.
The immediate solution I can think up off the top of my head would be to have an executor that runs 3 threads (using Executors.newFixedThreadPool, say). For each line/record read from the input file, fire off a job at the executor (using ExecutorService.submit). The executor will queue requests for you, and allocate between the 3 threads.
Probably better solutions exist, but hopefully that will do the job. :-)
ETA: Sounds a lot like Wolfbyte's second solution. :-)
ETA2: System.Threading.ThreadPool sounds like a very similar idea in .NET. I've never used it, but it may be worth your while!
Since the bottleneck will generally be in the processing and not the reading when dealing with files I'd go with the producer-consumer pattern. To avoid locking I'd look at lock free lists. Since you are using C# you can take a look at Julian Bucknall's Lock-Free List code.
#lomaxx
#Derek & Mark: I wish there was a way to accept 2 answers. I'm going to have to end up going with Wolfbyte's solution because if I split the file into n sections there is the potential for a thread to come across a batch of "slow" transactions, however if I was processing a file where each process was guaranteed to require an equal amount of processing then I really like your solution of just splitting the file into chunks and assigning each chunk to a thread and being done with it.
No worries. If clustered "slow" transactions is a issue, then the queuing solution is the way to go. Depending on how fast or slow the average transaction is, you might also want to look at assigning multiple lines at a time to each worker. This will cut down on synchronization overhead. Likewise, you might need to optimize your buffer size. Of course, both of these are optimizations that you should probably only do after profiling. (No point in worrying about synchronization if it's not a bottleneck.)
If the text that you are parsing is made up of repeated strings and tokens, break the file into chunks and for each chunk you could have one thread pre-parse it into tokens consisting of keywords, "punctuation", ID strings, and values. String compares and lookups can be quite expensive and passing this off to several worker threads can speed up the purely logical / semantic part of the code if it doesn't have to do the string lookups and comparisons.
The pre-parsed data chunks (where you have already done all the string comparisons and "tokenized" it) can then be passed to the part of the code that would actually look at the semantics and ordering of the tokenized data.
Also, you mention you are concerned with the size of your file occupying a large amount of memory. There are a couple things you could do to cut back on your memory budget.
Split the file into chunks and parse it. Read in only as many chunks as you are working on at a time plus a few for "read ahead" so you do not stall on disk when you finish processing a chunk before you go to the next chunk.
Alternatively, large files can be memory mapped and "demand" loaded. If you have more threads working on processing the file than CPUs (usually threads = 1.5-2X CPU's is a good number for demand paging apps), the threads that are stalling on IO for the memory mapped file will halt automatically from the OS until their memory is ready and the other threads will continue to process.

Categories