Will using multiple threads speed up my HTML file processing application?

Will using multiple threads speed up my HTML file processing application? - c#

I just finished up my most complex and feature-laden WinForms application to date. It loads a list any number of HTML files, then loads the content of one, uses some RegEx to match some tags and remove or replace them (yes, yes, I've seen this. It works just fine, thanks Cthulu), then writes it to disk.
However, I noticed that ~200 files takes roughly 30 seconds to process, and after the first 5-10 seconds the program is reported as "Not Responding". I'm assuming it's not wise to do something like this guy did, as the hard drive is a bottleneck.
Perhaps it'd be possible to load as many as possible into memory, then process each one with a thread, write those, then load some more into memory?
At the very least, would creating a worker thread separate from the UI thread prevent the "Not Responding" issue? (This MSDN article covers what I was considering.)
I guess I'm asking if multithreading will offer any sort of speed improvement, and if so, what would be the best way of going about it?
Any help or advice is much appreciated!

Yes, you should start by using a Backgroundworker to decouple your work from the GUI. Handling a GUI event should never take too much time. Aim for 20ms, not 20s.
Then as a bonus you could see if the processing (CPU intensive part) can be split into independent jobs and execute them as TPL Tasks.
There is insufficient information to say if or how you should do that.

Threading jobs, tasks, etc. will, in most cases, prevent the primary, or main thread from becoming non-responsive. Do not create multiple threads for disk IO (obviously). I would dedicate a single worker thread to taking your files off a queue and processing the disk IO. Otherwise, 1 or 2 worker threads to do in-memory processing should be sufficient while your main thread can remain responsive.

First of all, if you want the program to remain responsive move the calculations to a separate thread (remove it from the UI thread).
The actual performance improve depends on the number of processors you have, not the number of threads.
So if you have P threads, you can divide the work to P work items and get some work improvement. (Amdahl's Law)
You can use BackgroundWorker to divide the work properly. : C# BackgroundWorker Tutorial

Why not use StreamReader.ReadAllLines() to read each file into an array, and then process each element of the array?

If you do all your processing in the GUI-thread, your application will show the 'not responding' if it takes very long. In my opinion, you should try to never do (extensive) processing actions in the same thread as your GUI.
In addition, you could even just create a thread for each file to be processed. This will most likeley speed things up, as long as the seperate threads do not need any data from eachother.

Related

What is a multithreading program and how does it work?

What is a multithreading program and how does it work exactly? I read some documents but I'm confused. I know that code is executed line by line, but I can't understand how the program manages this.
A simple answer would be appreciated.c# example please (only animation!)

What is a multi-threading program and how does it work exactly?
Interesting part about this question is complete books are written on the topic, but still it is elusive to lot of people. I will try to explain in the order detailed underneath.
Please note this is just to provide a gist, an answer like this can never do justice to the depth and detail required. Regarding videos, best that I have come across are part of paid subscriptions (Wintellect and Pluralsight), check out if you can listen to them on trial basis, assuming you don't already have the subscription:
Wintellect by Jeffery Ritcher (from his Book, CLR via C#, has same chapter on Thread Fundamentals)
CLR Threading by Mike Woodring
Explanation Order
What is a thread ?
Why were threads introduced, main purpose ?
Pitfalls and how to avoid them, using Synchronization constructs ?
Thread Vs ThreadPool ?
Evolution of Multi threaded programming API, like Parallel API, Task API
Concurrent Collections, usage ?
Async-Await, thread but no thread, why they are best for IO
What is a thread ?
It is software implementation, which is purely a Windows OS concept (multi-threaded architecture), it is bare minimum unit of work. Every process on windows OS has at least one thread, every method call is done on the thread. Each process can have multiple threads, to do multiple things in parallel (provided hardware support).
Other Unix based OS are multi process architecture, in fact in Windows, even the most complex piece of software like Oracle.exe have single process with multiple threads for different critical background operations.
Why were threads introduced, main purpose ?
Contrary to the perception that concurrency is the main purpose, it was robustness that lead to the introduction of threads, imagine every process on Windows is running using same thread (in the initial 16 bit version) and out of them one process crash, that simply means system restart to recover in most of the cases. Usage of threads for concurrent operations, as multiple of them can be invoked in each process, came in picture down the line. In fact it is even important to utilize the processor with multiple cores to its full ability.
Pitfalls and how to avoid using Synchronization constructs ?
More threads means, more work completed concurrently, but issue comes, when same memory is accessed, especially for Write, as that's when it can lead to:
Memory corruption
Race condition
Also, another issue is thread is a very costly resource, each thread has a thread environment block, Kernel memory allocation. Also for scheduling each thread on a processor core, time is spent for context switching. It is quite possible that misuse can cause huge performance penalty, instead of improvement.
To avoid Thread related corruption issues, its important to use the Synchronization constructs, like lock, mutex, semaphore, based on requirement. Read is always thread safe, but Write needs appropriate Synchronization.
Thread Vs ThreadPool ?
Real threads are not the ones, we use in C#.Net, that's just the managed wrapper to invoke Win32 threads. Challenge remain in user's ability to grossly misuse, like invoking lot more than required number of threads, assigning the processor affinity, so isn't it better that we request a standard pool to queue the work item and its windows which decide when the new thread is required, when an already existing thread can schedule the work item. Thread is a costly resource, which needs to be optimized in usage, else it can be bane not boon.
Evolution of Multi threaded programming, like Parallel API, Task API
From .Net 4.0 onward, variety of new APIs Parallel.For, Parallel.ForEach for data paralellization and Task Parallelization, have made it very simple to introduce concurrency in the system. These APIs again work using a Thread pool internally. Task is more like scheduling a work for sometime in the future. Now introducing concurrency is like a breeze, though still synchronization constructs are required to avoid memory corruption, race condition or thread safe collections can be used.
Concurrent Collections, usage ?
Implementations like ConcurrentBag, ConcurrentQueue, ConcurrentDictionary, part of System.Collections.Concurrent are inherent thread safe, using spin-wait and much easier and quicker than explicit Synchronization. Also much easier to manage and work. There's another set API like ImmutableList System.Collections.Immutable, available via nuget, which are thread safe by virtue of creating another copy of data structure internally.
Async-Await, thread but no thread, why they are best for IO
This is an important aspect of concurrency meant for IO calls (disk, network), other APIs discussed till now, are meant for compute based concurrency so threads are important and make it faster, but for IO calls thread has no use except waiting for the call to return, IO calls are processed on hardware based queue IO Completion ports

A simple analogy might be found in the kitchen.
You've probably cooked using a recipe before -- start with the specified ingredients, follow the steps indicated in the recipe, and at the end you (hopefully) have a delicious dish ready to eat. If you do that, then you have executed a traditional (non-multithreaded) program.
But what if you have to cook a full meal, which includes a number of different dishes? The simple way to do it would be to start with the first recipe, do everything the recipe says, and when it's done, put the finished dish (and the first recipe) aside, then start on the second recipe, do everything it says, put the second dish (and second recipe) aside, and so on until you've gone through all of the recipes one after another. That will work, but you might end up spending 10 hours in the kitchen, and of course by the time the last dish is ready to eat, the first dish might be cold and unappetizing.
So instead you'd probably do what most chefs do, which is to start working on several recipes at the same time. For example, you might put the roast in the oven for 45 minutes, but instead of sitting in front of the oven waiting 45 minutes for the roast to cook, you'd spend the 45 minutes chopping the vegetables. When the oven timer rings, you put down your vegetable knife, pull the cooked roast out of the oven and let it cool, then go back to chopping vegetables, and so on. If you can do that, then you are successfully multitasking several recipes/programs. That is, you aren't literally working on multiple recipes at once (you still have only two hands!), but you are jumping back and forth from following one recipe to following another whenever necessary, and thereby making progress on several tasks rather than twiddling your thumbs a lot. Do this well and you can have the whole meal ready to eat in a much shorter amount of time, and everything will be hot and fresh at about the same time too. If you do this, you are executing a simple multithreaded program.
Then if you wanted to get really fancy, you might hire a few other chefs to work in the kitchen at the same time as you, so that you can get even more food prepared in a given amount of time. If you do this, your team is doing multiprocessing, with each chef taking one part of the total work and all of them working simultaneously. Note that each chef may well be working on multiple recipes (i.e. multitasking) as described in the previous paragraph.
As for how a computer does this sort of thing (no more analogies about chefs), it usually implements it using a list of ready-to-run threads and a timer. When the timer goes off (or when the thread that is currently executing has nothing to do for a while, because e.g. it is waiting to load data from a slow hard drive or something), the operating system does a context switch, in which pauses the current thread (by putting it into a list somewhere and no longer executing instructions from that thread's code anymore), then pulls another ready-to-run thread from the list of ready-to-run threads and starts executing instructions from that thread's code instead. This repeats for as long as necessary, often with context switches happening every few milliseconds, giving the illusion that multiple programs are running "at the same time" even on a single-core CPU. (On a multi-core CPU it does this same thing on each core, and in that case it's no longer just an illusion; multiple programs really are running at the same time)

Why don't you refer to Microsoft's very own documentation of the .net class System.Threading.Thread?
It has a handfull of simple example programs written in C# (at the bottom of the page) just as you asked for:
Thread Examples

actually multi thread is do multiple process at the same time together . and you can complete process parallel .

it's actually multi thread is do multiple process at the same time together . and you can complete process parallel . you can take task from your main thread then execute some other way and done .

ThreadPool.QueueUserWorkItem causing massive delay to UI thread due to lack of resources - better method to use? [duplicate]

Scenario
I have a Windows Forms Application. Inside the main form there is a loop that iterates around 3000 times, Creating a new instance of a class on a new thread to perform some calculations. Bearing in mind that this setup uses a Thread Pool, the UI does stay responsive when there are only around 100 iterations of this loop (100 Assets to process). But as soon as this number begins to increase heavily, the UI locks up into eggtimer mode and the thus the log that is writing out to the listbox on the form becomes unreadable.
Question
Am I right in thinking that the best way around this is to use a Background Worker?
And is the UI locking up because even though I'm using lots of different threads (for speed), the UI itself is not on its own separate thread?
Suggested Implementations greatly appreciated.
EDIT!!
So lets say that instead of just firing off and queuing up 3000 assets to process, I decide to do them in batches of 100. How would I go about doing this efficiently? I made an attempt earlier at adding "Thread.Sleep(5000);" after every batch of 100 were fired off, but the whole thing seemed to crap out....

If you are creating 3000 separate threads, you are pushing a documented limitation of the ThreadPool class:
If an application is subject to bursts
of activity in which large numbers of
thread pool tasks are queued, use the
SetMinThreads method to increase the
minimum number of idle threads.
Otherwise, the built-in delay in
creating new idle threads could cause
a bottleneck.
See that MSDN topic for suggestions to configure the thread pool for your situation.
If your work is CPU intensive, having that many separate threads will cause more overhead than it's worth. However, if it's very IO intensive, having a large number of threads may help things somewhat.
.NET 4 introduces outstanding support for parallel programming. If that is an option for you, I suggest you have a look at that.

More threads does not equal top speed. In fact too many threads equals less speed. If your task is simply CPU related you should only be using as many threads as you have cores otherwise you're wasting resources.
With 3,000 iterations and your form thread attempting to create a thread each time what's probably happening is you are maxing out the thread pool and the form is hanging because it needs to wait for a prior thread to complete before it can allocate a new one.
Apparently ThreadPool doesn't work this way. I have never checked it with threads before so I am not sure. Another possibility is that the tasks begin flooding the UI thread with invocations at which point it will give up on the GUI.

It's difficult to tell without seeing code - but, based on what you're describing, there is one suspect.
You mentioned that you have this running on the ThreadPool now. Switching to a BackgroundWorker won't change anything, dramatically, since it also uses the ThreadPool to execute. (BackgroundWorker just simplifies the invoke calls...)
That being said, I suspect the problem is your notifications back to the UI thread for your ListBox. If you're invoking too frequently, your UI may become unresponsive while it tries to "catch up". This can happen if you're feeding too much status info back to the UI thread via Control.Invoke.
Otherwise, make sure that ALL of your work is being done on the ThreadPool, and you're not blocking on the UI thread, and it should work.

If every thread logs something to your ui, every written log line must invoke the main thread. Better to cache the log-output and update the gui only every 100 iterations or something like that.

Since I haven't seen your code so this is just a lot of conjecture with some highly hopefully educated guessing.
All a threadpool does is queue up your requests and then fire new threads off as others complete their work. Now 3000 threads doesn't sounds like a lot but if there's a ton of processing going on you could be destroying your CPU.
I'm not convinced a background worker would help out since you will end up re-creating a manager to handle all the pooling the threadpool gives you. I think more you issue is you've got too much data chunking going on. I think a good place to start would be to throttle the amount of threads you start and maintain. The threadpool manager easily allows you to do this. Find a balance that allows you to process data while still keeping the UI responsive.

Multi threading which would be the best to use? (Threadpool or threads)

Hopefully this is a better question than my previous. I have a .exe which I will be passing different parameters (file paths) to which it will then take in and parse. So I will have a loop going, looping through the file paths in a list and passing them to this .exe file.
For this to be more efficient, I want to spread the execution across multiple cores which I think you do through threading.
My question is, should I use the threadpool, or multiple threads to run this .exe asynchronously?
Also, depending on which one of those you guys think is the best, if you can point me to a tutorial that will have some info on what I want to do. Thank you!
EDIT:
I need to limit the number of executions of the .exe to ONE execution PER CORE. This is the most efficient because if I am parsing 100,000 files I can't just fire up 100000 processes. So I am using threads to limit the number of executions at one time to one execution per core. If there is another way (other than threads) to find out if a processor isn't tied up in execution, or if the .exe has finished please explain.
But if there isn't another way, my FINAL question is how would I use a thread to call a parse method and then call back when that thread is no longer in use?
SECOND UPDATE (VERY IMPORTANT):
I went through what everyone told me, and found out a key element that I left out that I thought didn't matter. So I am using a GUI and I don't want it to be locked up. THAT is why I wanted to use threads. My main question now is, how do I send back information from a thread so I know when the execution is over?

As I said in my answer to your previous question, I think you don't understand the difference between processes and threads. Processes are incredibly "heavy" (*); each process can contain many threads. If you are spawning new processes from a parent process, that parent process doesn't need to create new threads; each process will have its own collection of threads.
Only create threads in the parent process if all the work is being done in the same process.
Think of a thread as a worker, and a process as a building containing one or more workers.
One strategy is "build a single building and populate it with ten workers who do each do some amount of work". You get the expense of building one process and ten threads.
If your strategy is "build a building. Then have the one worker in that building order the construction of a thousand more buildings, each of which contains a worker that does their bidding", then you get the expense of building 1001 buildings and hiring 1001 workers.
The strategy you do not want to pursue is "build a building. Hire 1000 workers in that building. Then instruct each worker to build a building, which then has one worker to go do the real work." There is no point in making a thread whose sole job is creating a process that then creates a thread! You have 1001 buildings and 2001 workers, half of whom are immediately idle but still have to be paid.
Looking at your specific problem: the key question is "where is the bottleneck?" Spawning off new processes or new threads only helps when the performance problem is that the perf is gated on the processor. If the performance of your parser is gated not on how fast you can parse the file but rather on how fast you can get it off disk, then parallelizing it is going to make things far, far worse. You'll have a huge amount of system resources devoted to all hammering on the same disk controller at the same time, and the disk controller will get slower as more load piles up on it.
UPDATE:
I need to limit the number of executions of the .exe to ONE execution PER CORE. This is the most efficient because if I am parsing 100,000 files I can't just fire up 100000 processes. So I am using threads to limit the number of executions at one time to one execution per core. If there is another way (other than threads) to find out if a processor isn't tied up in execution, or if the .exe has finished please explain
This seems like an awfully complicated way to go about it. Suppose you have n processors. Your proposed strategy, as I understand it, is to fire up n threads, then have each thread fire up one process, and you know that since the operating system will probably schedule one thread per CPU that somehow the processor will magically also schedule the new thread in each new process on a different CPU?
That seems like a tortuous chain of reasoning that depends on implementation details of the operating system. This is craziness. If you want to set the processor affinity of a particular process, just set the processor affinity on the process! Don't be doing this crazy thing with threads and hope that it works out.
I say that if you want to have no more than n instances of an executable running, one per processor, don't mess around with threads at all. Rather, just have one thread sit in a loop, constantly monitoring what processes are running. If there are fewer than n copies of the executable running, spawn another and set its processor affinity to be the CPU you like best. If there are n or more copies of the executable running, go to sleep for a second (or a minute, or whatever makes sense), and when you wake up, check again. Keep doing that until you're done. That seems like a much easier approach.
(*) Threads are also heavy, but they are lighter than processes.

Spontaneously I would push your file paths into a thread safe queue and then fire up a number of threads (say one per core). Each thread would repeatedly pop one item from the queue and process the it accordingly. The work is done when the queue is empty.
Implementation suggestions (to answer some of the questions in comments):
Queue:
In C# you could have a look at the Queue Class and the Queue.Synchronized Method for the implementation of the queue:
"Public static (Shared in Visual Basic) members of this type are thread safe. Any instance members are not guaranteed to be thread safe.
To guarantee the thread safety of the Queue, all operations must be done through the wrapper returned by the Synchronized method.
Enumerating through a collection is intrinsically not a thread-safe procedure. Even when a collection is synchronized, other threads can still modify the collection, which causes the enumerator to throw an exception. To guarantee thread safety during enumeration, you can either lock the collection during the entire enumeration or catch the exceptions resulting from changes made by other threads."
Threading:
For the threading part I suppose that any of the examples in the msdn threading tutorial would do (the tutorial is a bit old, but should be valid). Should not need to worry about synchronizing the threads as they can work independently from each other. The queue above is the only common resource they should need to access (hence the importance of thread safety of the queue).
Start the external process (.exe):
The following code is borrowed (and tweaked) from How to wait for a shelled application to finish by using Visual C#. You need to edit for your own needs, but as a starter:
//How to Wait for a Shelled Process to Finish
//Create a new process info structure.
ProcessStartInfo pInfo = new ProcessStartInfo();
//Set the file name member of the process info structure.
pInfo.FileName = "mypath\myfile.exe";
//Start the process.
Process p = Process.Start(pInfo);
//Wait for the process to end.
p.WaitForExit();
Pseudo code:
Main thread;
Create thread safe queue
Populate the queue with all the file paths
Create child threads and wait for them to finish
Child threads:
While queue is not empty << this section is critical, not more then one
pop file from queue << thread can check and pop at the time
start external exe
wait for it....
end external exe
end while
Child thread exits
Main thread waits for all child threads to finish
Program finishes.

See this question for how to find out the number of cores.
Then use Parallel.ForEach with ParallelOptions with MaxDegreeOfParallelism set to the number of cores.
Parallel.ForEach(args, new ParallelOptions() { MaxDegreeOfParallelism = Environment.ProcessorCount }, (element) => Console.WriteLine(element));

If you're targeting the .Net 4 framework the Parallel.For or Parallel.Foreach are extremely helpful. If those don't meet your requirements I've found the Task.Factory to be useful and straightforward to use as well.

To answer your revised question, you want processes. You just need to create the correct number of processes running the exe. Don't worry about forcing them onto specific cores. Windows will do that automatically.
How to do this:
You want to determine the number of cores on the machine. You may simply know it, and hardcode it, or you might want to use something like System.Environment.ProcessorCount.
Create a List<Process> object.
Then you want to start that many processes using System.Diagnostics.Process.Start. The return value will be a process object, which you will want to add to the List.
Now repeat the following until you are finished:
Call Thread.Sleep to wait for a while. Perhaps a minute or so.
Loop through each Process in the list but be sure to use a for loop rather than a foreach loop. For each process, call Refresh() then check the 'HasExited' property of each process, and if it is true, create a new process using Process.Start, and replace the exited process in the list with the newly created one.

If you're launching a .exe, then you have no choice. You will be running this asynchronously in a separate process. For the program which does the launching, I would recommend that you use a single thread and keep a list of the processes you launched.

Each exe launched will occur in its own process. You don't need to use a threadpool or multiple threads; the OS manages the processes (and since they're processes and not threads, they're very independent; completely separate memory space, etc.).

which way is better in a frequently write case with thread?

We need to write some data into a file at about 100ms interval for about 30 minutes and then wait for a while, and repeat again. Our application is a C#, .net 3.5 application. The data is small for each write , less than 1MB. Right now we get a thread from the Threadpool and let that thread to write into the file each time when a new data is received (at about 100ms interval).
There is another way to do this, I think. We can get a thread from Threadpool at beginning and keep that thread running during the entire writing session. When that thread finished a write, let it wait for next signal to get updated data from a shared place and write again. The downside of this way is we need to synchroized the shared data object to make sure it will not be overwriten by the the new data if the writting is slower. then it may slow down the communication which the data is transferred from another system.
I don't have time to write code test them yet. do you think it is worthy to test them? or it is obviously one way is better than another?

You can use a producer/consumer pattern, so a thread can have the file locked all the time and you don't need to open and close the FileStream.
Here an example: http://www.yoda.arachsys.com/csharp/threads/deadlocks.shtml
(in the "More Monitor methods" section)

If you think that you might have contention writing to your file, then a single thread writer is a good idea. Otherwise, there is no real problem dispatching a single operation to a miscellanous thread imo.

Create new threads or get more work for threads

I've got a program I'm creating(in C#) and I see two approaches..
1) A job manager that waits for any number of X threads to finish, when finished it gets the next chunk of work and creates a new thread and gives it that chunk
or
2) We create X threads to start, give them each a chunk of work, and when a thread finishes a chunk its asks the job manager for more work. If there isn't any more work it sleeps and then asks again, with the sleep becoming progressively longer.
This program will be a run and done, tho I could see it turning into a service that continually looks for more jobs.
Each chunk will consists of a number of data ids, a call to the database to get some info or perform an operation on the data id, and then writing to the database info on the data id.

Assuming you are aware of the additional precautions that need to be taken when dealing with multithreaded database operations, it sounds like you're describing two different scenarios. In the first, you have several threads running, and once ALL of them finish it will look for new work. In the second, you have several threads running and their operations are completely parallel. Your environment is going to be what determines the proper approach to take; if there is something tying all of the work in the several threads where additional work cannot continue until all of them are finished, then with the former. If they don't have much affect on each other, go with the latter.

The second option isn't really right, as making the sleep time progressively longer means that you will unnecessarily keep those threads blocked.
Rather, you should have a pooled set of threads like the second option, but they use WaitHandles to wait for work and use a producer/consumer pattern. Basically, when the producer indicates that there is work, it sends a signal to a consumer (there will be a manager which will determine which thread will get the work, and then signal that thread) which will wake up and start working.
You might want to look into the Parallel Task Library. It's in beta now, but if you can use it and are comfortable with it, I would recommend it, as it will manage a great deal of this for you (and much better, taking into account the number of cores on a machine, the optimal number of threads, etc, etc).

The former solution (spawn a thread for each new piece of work), is easier to code, and not too bad, if the units of work are large enough.
The second solution (thread-pool, with a queue of work), is more complicated to code, but supports smaller units of work.

Instead of rolling your own solution, you should look at the ThreadPool class in the .NET framework. You could use the QueueUserWorkItem method. It should do exactly what you want to accomplish.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.