Quicker file reading using multi-threading? - c#

I wrote a script to read a 100mb+ text file using a single thread and multiple threads. The multi-threaded script shares the same StreamReader, and locks it during the StreamReader.ReadLine() call. After timing my two scripts, they are about the same speed (it seems that the ReadLine() is what's taking up most of the run-time).
Where can I take this next? I'm thinking of splitting the source file into multiple text files so each thread can work with its own StreamReader, but that seems a bit cumbersome. Is there a better way to speed up my process?
Thanks!

With a single hard-disk, there's not much you can do except use a single producer (to read files) multiple consumer (for processing) model. A hard disk needs to move the mechanical "head" in order to seek the next reading position. Multiple threads doing this will just bounce the head around and not bring any speedup (worse, in some cases it may be slower).
Splitting the input file is even worse, because now the file chunks are no longer consecutive and need further seeking.
So use a single thread to read chunks of the large file and either put the tasks in a synchronized queue (e.g. ConcurrentQueue) for multiple consumer threads or use QueueUserWorkItem to access the built-in thread pool.

Where can you take this next?
Add multiple HDDs then have 1 thread per HDD. Split your file across the HDDs. Kinda like RAID.
EDIT:
Similar questions have been asked many times here. Just use 1 threads to read file and 1 thread to process. No multithreading needed.

Related

Parallel IO read on a single file

I'm trying to read a pretty large file (around 4GB) and parse it. I make use of producer/consumer and reads file line by line with offloading it for processing. Is this possible to parallelize reading of single file itself by starting multiple IO threads reading file by chunks and pass it to consumer from multiple threads? I was searching about and it said better to use one thread for IO and then do parallelization, can someone elaborate on this please ?

best way to handle multiple streams to same file

I am trying to create a binary storage. Files once added, can't be deleted. All files are stored in the same master file (storage.bin). i will have a table index with pairs of (key, startFilePos, fileLength).
I am struggling to find the best solution before i start coding. As i am not that familiar with streams under .net, i would like some advice on them.
are multiple streams to the same file (storage.bin) a solution?
As files may be added in parallel, once i start receiving a file, i
reserve the space on disk for it and start writing that specific
file to the reserved block while other same-scenario files are added
on other threads.
if (1) is a viable solution, is it the optimal one? if not, could
you point to a better one?
what is the best way to allocate threads for every file-add scenarios? It seems like 1 thread/ 1 stream isn't a viable solution. What i was thinking is to have a a thread pool and if the limit is reached, have a new file wait for one to be available.
are multiple streams to the same file (storage.bin) a solution? As files may be added in parallel, once i start receiving a file, i
reserve the space on disk for it and start writing that specific file
to the reserved block while other same-scenario files are added on
other threads.
Yes, this could be a solution. As long as the different streams are writing to different non-overlapping segments of the file there will be no problems.
if (1) is a viable solution, is it the optimal one? if not, could you point to a better one?
Use separate files and keep the index in the master file.
what is the best way to allocate threads for every file-add scenarios? It seems like 1 thread/ 1 stream isn't a viable solution.
What i was thinking is to have a a thread pool and if the limit is
reached, have a new file wait for one to be available
You've just described how a thread pool works. Before rolling your own, make sure you have checked the ThreadPool class built directly into the framework. Or a custom concurrency limiting task scheduler.

use Task parallel library for I/O bound processing

Wondering if you could clarify.
I am writing a tool that all has todo is retrieve data from a database (sql server) and create txt files.
I am talking 500.000 txt files.
It's working and all is good.
However I was wondering if using Task Parallel library could improve and speed up the time it takes to create these files.
I know (read) that "TPL" is not meant to be used for I/0 bound processing and that most likely it will perform the same as
sequential .
Is this true?
Also in an initial attempt using a simple "foreach parallel" I was getting an error cannot access file because is in use.
Any advice?
You do not parallel I/O bound processes.
The reason is simple: because CPU is not the bottleneck. No matter you start how many threads, You only have ONE disk to write to, and that is the slowest thing.
So what you need to is to simply iterate every file and write them. You can start a seperate working thread doing this work, or using async I/O to get a better UI response.
If you read and/or write from multiple disks, then parallizing could improve speed. E.g if you want to read all your files and run a hash on them and store the hash, then you could create one thread per disk and you would see a significant speed up. However, if your case it seems like tasks are unlikely to improve performance.

Multithreaded file writing

I am trying to write to different pieces of a large file using multiple threads, just like a segmented file downloader would do.
My question is, what is the safe way to do this? Do I open the file for writing, create my threads, passing the Stream object to each thread? I don't want an error to occur because multiple threads are accessing the same object at potentially the same time.
This is C# by the way.
I would personally suggest that you fetch the data in multiple threads, but actually write to it from a single thread. It's likely to be considerably simpler that way. You could use a producer/consumer queue (which is really easy in .NET 4) and then each producer would feed pairs of "index, data". The consumer thread could then just sequentially seek, write, seek, write etc.
If this were Linux programming, I would recommend you look into the pwrite() command, which writes a buffer to a file at a given offset. A cursory search of C# documentation doesn't turn up anything like this however. Does anyone know if a similar function exists?
Although one might be able to open multiple streams pointing to the same file, and use a different stream in each thread, I would second the advice of using a single thread for the writing absent some reason to do otherwise. Even if two or more threads can safely write to the same file simultaneously, that doesn't mean it's a good idea. It may be helpful to have the unified thread attempt to sequence writes in a sensible order to avoid lots of random seeking; the performance benefit from that would depend upon how effectively the OS could cache and schedule random writes. Don't go crazy optimizing such things if it turns out the OS does a good job, but be prepared to add some optimization if the OS default behavior turns out to perform poorly.

Multicore Text File Parsing

I have a quad core machine and would like to write some code to parse a text file that takes advantage of all four cores. The text file basically contains one record per line.
Multithreading isn't my forte so I'm wondering if anyone could give me some patterns that I might be able to use to parse the file in an optimal manner.
My first thoughts are to read all the lines into some sort of queue and then spin up threads to pull the lines off the queue and process them, but that means the queue would have to exist in memory and these are fairly large files so I'm not so keen on that idea.
My next thoughts are to have some sort of controller that will read in a line and assign it a thread to parse, but I'm not sure if the controller will end up being a bottleneck if the threads are processing the lines faster than it can read and assign them.
I know there's probably another simpler solution than both of these but at the moment I'm just not seeing it.
I'd go with your original idea. If you are concerned that the queue might get too large implement a buffer-zone for it (i.e. If is gets above 100 lines the stop reading the file and if it gets below 20 then start reading again. You'd need to do some testing to find the optimal barriers). Make it so that any of the threads can potentially be the "reader thread" as it has to lock the queue to pull an item out anyway it can also check to see if the "low buffer region" has been hit and start reading again. While it's doing this the other threads can read out the rest of the queue.
Or if you prefer, have one reader thread assign the lines to three other processor threads (via their own queues) and implement a work-stealing strategy. I've never done this so I don't know how hard it is.
Mark's answer is the simpler, more elegant solution. Why build a complex program with inter-thread communication if it's not necessary? Spawn 4 threads. Each thread calculates size-of-file/4 to determine it's start point (and stop point). Each thread can then work entirely independently.
The only reason to add a special thread to handle reading is if you expect some lines to take a very long time to process and you expect that these lines are clustered in a single part of the file. Adding inter-thread communication when you don't need it is a very bad idea. You greatly increase the chance of introducing an unexpected bottleneck and/or synchronization bugs.
This will eliminate bottlenecks of having a single thread do the reading:
open file
for each thread n=0,1,2,3:
seek to file offset 1/n*filesize
scan to next complete line
process all lines in your part of the file
My experience is with Java, not C#, so apologies if these solutions don't apply.
The immediate solution I can think up off the top of my head would be to have an executor that runs 3 threads (using Executors.newFixedThreadPool, say). For each line/record read from the input file, fire off a job at the executor (using ExecutorService.submit). The executor will queue requests for you, and allocate between the 3 threads.
Probably better solutions exist, but hopefully that will do the job. :-)
ETA: Sounds a lot like Wolfbyte's second solution. :-)
ETA2: System.Threading.ThreadPool sounds like a very similar idea in .NET. I've never used it, but it may be worth your while!
Since the bottleneck will generally be in the processing and not the reading when dealing with files I'd go with the producer-consumer pattern. To avoid locking I'd look at lock free lists. Since you are using C# you can take a look at Julian Bucknall's Lock-Free List code.
#lomaxx
#Derek & Mark: I wish there was a way to accept 2 answers. I'm going to have to end up going with Wolfbyte's solution because if I split the file into n sections there is the potential for a thread to come across a batch of "slow" transactions, however if I was processing a file where each process was guaranteed to require an equal amount of processing then I really like your solution of just splitting the file into chunks and assigning each chunk to a thread and being done with it.
No worries. If clustered "slow" transactions is a issue, then the queuing solution is the way to go. Depending on how fast or slow the average transaction is, you might also want to look at assigning multiple lines at a time to each worker. This will cut down on synchronization overhead. Likewise, you might need to optimize your buffer size. Of course, both of these are optimizations that you should probably only do after profiling. (No point in worrying about synchronization if it's not a bottleneck.)
If the text that you are parsing is made up of repeated strings and tokens, break the file into chunks and for each chunk you could have one thread pre-parse it into tokens consisting of keywords, "punctuation", ID strings, and values. String compares and lookups can be quite expensive and passing this off to several worker threads can speed up the purely logical / semantic part of the code if it doesn't have to do the string lookups and comparisons.
The pre-parsed data chunks (where you have already done all the string comparisons and "tokenized" it) can then be passed to the part of the code that would actually look at the semantics and ordering of the tokenized data.
Also, you mention you are concerned with the size of your file occupying a large amount of memory. There are a couple things you could do to cut back on your memory budget.
Split the file into chunks and parse it. Read in only as many chunks as you are working on at a time plus a few for "read ahead" so you do not stall on disk when you finish processing a chunk before you go to the next chunk.
Alternatively, large files can be memory mapped and "demand" loaded. If you have more threads working on processing the file than CPUs (usually threads = 1.5-2X CPU's is a good number for demand paging apps), the threads that are stalling on IO for the memory mapped file will halt automatically from the OS until their memory is ready and the other threads will continue to process.

Categories