Thread management console program with continuous (endless) tasks - c#

I'm new to threading so have some patience please.
I have tens of thousands of rows in a database. Each row represents a job needed to be done over the internet. I read a data row, I do some network-related work (which can even take between a couple of seconds up to a couple of minutes) and I grab the next data row (my C# application uses console, not GUI). As you might expect I want to do these jobs concurrently.
I looked into this subject and I thought I would use BackgroundThreads, but if I understand correctly people suggest there is no point in using them in a console application.
I assume I should not use Tasks, because each of my "tasks" will be represented by a single thread.
So I thought I would use ThreadPool with regular Threads.
To make things simple I just want to keep a constant number of threads (spawn new ones when one finishes) untill I run out of things to do (then I wait for data - usually alot of it - to arrive in the database and spawn threads). I need to know when a Thread ends because I have to spawn a new thread and update the database row containing data it was working with. To keep threads and database in sync I would probably have to mark database row with some kind of thread id when it is retrieved and then mark the row (success/fail) when thread ends.
Is this solution (try catch in thread delegate) enough to be sure that a thread has ended (and if it succeded or threw exception)?
I am not sure how to "wait" for the first thread to end - not all and not a particular one.
I also think that I don't want to read too much data in advance (and potentially wait for a thread to free up) because there might be other programs doing the same thing using the same database.
Any ideas appreciated!

Just use Parallel.ForEach to do this:
Parallel.ForEach(rows, row => ProcessRow(row));
If you need to specify a max degree of parallelization because the automatic partitioner happens to be using too many thread pool threads then you can specify it like so:
Parallel.ForEach(rows, new ParallelOptions() { MaxDegreeOfParallelism = 5 }
, row => ProcessRow(row));

Related

Monitoring a multi-threaded application

I am writing a multi-thread application (using C#) where the job of each thread is to insert some data into database. As soon the thread completes its job of inserting data into database it becomes free (i.e. ready to insert another data into database). All the threads are reading data from a queue.
The problem is, how to monitor which thread has completed its current job and ready to take second job? Whether we can use C# task instead of thread and how?
Please note every thread is inserting data to the same database.
The problem is, how to monitor which thread has completed its current job and
ready to take second job?
Why would you do that? Threads created are looping until there is no data. If there is no thread (or less than wanted) and data arrives, start a new thread/task. Actually uses tasks. There is no need to monitor them. This would be ridiculously inefficient.
Whether we can use C# task instead of thread and how?
Yes, and it is as simple as "look up how to start a task, which google has an answer for". That said, your architecture likely needs adjustment - doing too many things in parallel will only waste memory, rather limit the number of active threads/tasks to a specific number.
In my opinion you should use only Task not "Task in Thread".
Tasks are more flexible and already implemented robustly. In your case you can create an Task[] (with 10 Tasks if you want) and to know if the task has completed his work you can check the Task.Result value if you have declared Task<\TResult> objects.
in this way you can have direct control of the processes during the asynchronous execution including exception handling.

How do I ensure that any of the threads are not waiting for something indefinitely?

I'm in process of writing a multi threaded application. Here's my case.
I grab a thousand records from database. Divide it into 5 chunks of list objects. and create 5 threads to process them. I do this same thing every minute until I have records remaining in the database
Task.Factory.StartNew(() => ProcessRecords(listRecords))
Inside ProcessRecords method, there is a small database update and some send mail takes place. (I'm using System.Net.Mail for email and don't use any ORM for db operation.)
Now I am worried that a thread might not complete because of some unknown issues. What will happen in that situation? Lets say one process (or even more) keeps on waiting for a deadlock in the database or something, what will happen to my application. It will keep on adding new threads with new set of records while some never ending threads. How can I implement something like timeout in this situation?
I want to run this process, terminate it in 5 minutes if it is not able to complete it.
Check out something called a TaskCancellationToken. You can use that to kill the task if you decide (by whatever means you prefer) that it's been running too long.
Alternatively, you could build that into the ProcessRecords() method itself: just have it commit seppuku if it runs too long by having it track its own start time and checking the elapsed time now and then; could be simpler.
That said, if you haven't already given it a shot, you might check to see whether .AsParallel() will save you some headaches here. There are a lot of cases where you can leave your parallelization woes to the compiler entirely.
Parallel.ForEach(db.Records, r => ProcessRecord(r));
Edit:
Parallel.ForEach(db.Records, ProcessRecord);
Yes. :)
Further edit:
For the OP, no, the TaskFactory doesn't offer anything like that out of the box. If you want to terminate the process from outside the process, you'll need to roll your own mechanism using some kind of a watcher thread to keep track of which tasks you have running, how long they've been running, and their respective cancellation tokens (or maybe just a bool you have at the top of a while loop...).

Multi threading which would be the best to use? (Threadpool or threads)

Hopefully this is a better question than my previous. I have a .exe which I will be passing different parameters (file paths) to which it will then take in and parse. So I will have a loop going, looping through the file paths in a list and passing them to this .exe file.
For this to be more efficient, I want to spread the execution across multiple cores which I think you do through threading.
My question is, should I use the threadpool, or multiple threads to run this .exe asynchronously?
Also, depending on which one of those you guys think is the best, if you can point me to a tutorial that will have some info on what I want to do. Thank you!
EDIT:
I need to limit the number of executions of the .exe to ONE execution PER CORE. This is the most efficient because if I am parsing 100,000 files I can't just fire up 100000 processes. So I am using threads to limit the number of executions at one time to one execution per core. If there is another way (other than threads) to find out if a processor isn't tied up in execution, or if the .exe has finished please explain.
But if there isn't another way, my FINAL question is how would I use a thread to call a parse method and then call back when that thread is no longer in use?
SECOND UPDATE (VERY IMPORTANT):
I went through what everyone told me, and found out a key element that I left out that I thought didn't matter. So I am using a GUI and I don't want it to be locked up. THAT is why I wanted to use threads. My main question now is, how do I send back information from a thread so I know when the execution is over?
As I said in my answer to your previous question, I think you don't understand the difference between processes and threads. Processes are incredibly "heavy" (*); each process can contain many threads. If you are spawning new processes from a parent process, that parent process doesn't need to create new threads; each process will have its own collection of threads.
Only create threads in the parent process if all the work is being done in the same process.
Think of a thread as a worker, and a process as a building containing one or more workers.
One strategy is "build a single building and populate it with ten workers who do each do some amount of work". You get the expense of building one process and ten threads.
If your strategy is "build a building. Then have the one worker in that building order the construction of a thousand more buildings, each of which contains a worker that does their bidding", then you get the expense of building 1001 buildings and hiring 1001 workers.
The strategy you do not want to pursue is "build a building. Hire 1000 workers in that building. Then instruct each worker to build a building, which then has one worker to go do the real work." There is no point in making a thread whose sole job is creating a process that then creates a thread! You have 1001 buildings and 2001 workers, half of whom are immediately idle but still have to be paid.
Looking at your specific problem: the key question is "where is the bottleneck?" Spawning off new processes or new threads only helps when the performance problem is that the perf is gated on the processor. If the performance of your parser is gated not on how fast you can parse the file but rather on how fast you can get it off disk, then parallelizing it is going to make things far, far worse. You'll have a huge amount of system resources devoted to all hammering on the same disk controller at the same time, and the disk controller will get slower as more load piles up on it.
UPDATE:
I need to limit the number of executions of the .exe to ONE execution PER CORE. This is the most efficient because if I am parsing 100,000 files I can't just fire up 100000 processes. So I am using threads to limit the number of executions at one time to one execution per core. If there is another way (other than threads) to find out if a processor isn't tied up in execution, or if the .exe has finished please explain
This seems like an awfully complicated way to go about it. Suppose you have n processors. Your proposed strategy, as I understand it, is to fire up n threads, then have each thread fire up one process, and you know that since the operating system will probably schedule one thread per CPU that somehow the processor will magically also schedule the new thread in each new process on a different CPU?
That seems like a tortuous chain of reasoning that depends on implementation details of the operating system. This is craziness. If you want to set the processor affinity of a particular process, just set the processor affinity on the process! Don't be doing this crazy thing with threads and hope that it works out.
I say that if you want to have no more than n instances of an executable running, one per processor, don't mess around with threads at all. Rather, just have one thread sit in a loop, constantly monitoring what processes are running. If there are fewer than n copies of the executable running, spawn another and set its processor affinity to be the CPU you like best. If there are n or more copies of the executable running, go to sleep for a second (or a minute, or whatever makes sense), and when you wake up, check again. Keep doing that until you're done. That seems like a much easier approach.
(*) Threads are also heavy, but they are lighter than processes.
Spontaneously I would push your file paths into a thread safe queue and then fire up a number of threads (say one per core). Each thread would repeatedly pop one item from the queue and process the it accordingly. The work is done when the queue is empty.
Implementation suggestions (to answer some of the questions in comments):
Queue:
In C# you could have a look at the Queue Class and the Queue.Synchronized Method for the implementation of the queue:
"Public static (Shared in Visual Basic) members of this type are thread safe. Any instance members are not guaranteed to be thread safe.
To guarantee the thread safety of the Queue, all operations must be done through the wrapper returned by the Synchronized method.
Enumerating through a collection is intrinsically not a thread-safe procedure. Even when a collection is synchronized, other threads can still modify the collection, which causes the enumerator to throw an exception. To guarantee thread safety during enumeration, you can either lock the collection during the entire enumeration or catch the exceptions resulting from changes made by other threads."
Threading:
For the threading part I suppose that any of the examples in the msdn threading tutorial would do (the tutorial is a bit old, but should be valid). Should not need to worry about synchronizing the threads as they can work independently from each other. The queue above is the only common resource they should need to access (hence the importance of thread safety of the queue).
Start the external process (.exe):
The following code is borrowed (and tweaked) from How to wait for a shelled application to finish by using Visual C#. You need to edit for your own needs, but as a starter:
//How to Wait for a Shelled Process to Finish
//Create a new process info structure.
ProcessStartInfo pInfo = new ProcessStartInfo();
//Set the file name member of the process info structure.
pInfo.FileName = "mypath\myfile.exe";
//Start the process.
Process p = Process.Start(pInfo);
//Wait for the process to end.
p.WaitForExit();
Pseudo code:
Main thread;
Create thread safe queue
Populate the queue with all the file paths
Create child threads and wait for them to finish
Child threads:
While queue is not empty << this section is critical, not more then one
pop file from queue << thread can check and pop at the time
start external exe
wait for it....
end external exe
end while
Child thread exits
Main thread waits for all child threads to finish
Program finishes.
See this question for how to find out the number of cores.
Then use Parallel.ForEach with ParallelOptions with MaxDegreeOfParallelism set to the number of cores.
Parallel.ForEach(args, new ParallelOptions() { MaxDegreeOfParallelism = Environment.ProcessorCount }, (element) => Console.WriteLine(element));
If you're targeting the .Net 4 framework the Parallel.For or Parallel.Foreach are extremely helpful. If those don't meet your requirements I've found the Task.Factory to be useful and straightforward to use as well.
To answer your revised question, you want processes. You just need to create the correct number of processes running the exe. Don't worry about forcing them onto specific cores. Windows will do that automatically.
How to do this:
You want to determine the number of cores on the machine. You may simply know it, and hardcode it, or you might want to use something like System.Environment.ProcessorCount.
Create a List<Process> object.
Then you want to start that many processes using System.Diagnostics.Process.Start. The return value will be a process object, which you will want to add to the List.
Now repeat the following until you are finished:
Call Thread.Sleep to wait for a while. Perhaps a minute or so.
Loop through each Process in the list but be sure to use a for loop rather than a foreach loop. For each process, call Refresh() then check the 'HasExited' property of each process, and if it is true, create a new process using Process.Start, and replace the exited process in the list with the newly created one.
If you're launching a .exe, then you have no choice. You will be running this asynchronously in a separate process. For the program which does the launching, I would recommend that you use a single thread and keep a list of the processes you launched.
Each exe launched will occur in its own process. You don't need to use a threadpool or multiple threads; the OS manages the processes (and since they're processes and not threads, they're very independent; completely separate memory space, etc.).

Clearing the ThreadPool

I'm working on my first ThreadPool application in Visual Studio 2008 with C#.
I have a report that has to perform calculations on 2000 to 4000 parts using data on our SQL Server.
I am Queuing all of the part numbers in a ThreadPool, where they go off and calculate their results. When these threads are finished, the RegisterWaitForSingleObject event fires to Unregister the Queued Item's Handle.
After all of the Queued Items have finished, is there a way to remove them from the ThreadPool?
The way it looks, if someone runs another report using a new set of 2000 to 4000 parts, I have no way of removing the previous array of parts.
How would I remove the Previously Queued Items? Would calling SetMaxThreads with workerThreads = 0 do it?
I realize I could experiment, but then I could waste most of the week experimenting.
Thanks for your time,
Joe
Once a ThreadPool item completes, it is automatically removed from the queue. What is indicating to you that they aren't?
Assuming you mean to interrupt (cancel) the work on the current queue...
Changing the max-threads won't affect the pending work; it'll just change the number of threads available to do it - and it is generally a bad idea to mess with this (your code isn't the only thing using the ThreadPool). I would use a custom queue - it is fairly easy to write a basic (thread-safe) producer/consumer queue, or .NET 4.0 includes some very good custom thread queues.
Then you can just abort the custom queue and start a new one.
I wrote a simple one here; at the moment it wants to exit cleanly (i.e. drain the queue until it is empty) before terminating, but it would be easy enough to add a flag to stop immediately after the current item (don't resort to interrupting/aborting threads at an arbitrary point in execution; never a good idea).

Create new threads or get more work for threads

I've got a program I'm creating(in C#) and I see two approaches..
1) A job manager that waits for any number of X threads to finish, when finished it gets the next chunk of work and creates a new thread and gives it that chunk
or
2) We create X threads to start, give them each a chunk of work, and when a thread finishes a chunk its asks the job manager for more work. If there isn't any more work it sleeps and then asks again, with the sleep becoming progressively longer.
This program will be a run and done, tho I could see it turning into a service that continually looks for more jobs.
Each chunk will consists of a number of data ids, a call to the database to get some info or perform an operation on the data id, and then writing to the database info on the data id.
Assuming you are aware of the additional precautions that need to be taken when dealing with multithreaded database operations, it sounds like you're describing two different scenarios. In the first, you have several threads running, and once ALL of them finish it will look for new work. In the second, you have several threads running and their operations are completely parallel. Your environment is going to be what determines the proper approach to take; if there is something tying all of the work in the several threads where additional work cannot continue until all of them are finished, then with the former. If they don't have much affect on each other, go with the latter.
The second option isn't really right, as making the sleep time progressively longer means that you will unnecessarily keep those threads blocked.
Rather, you should have a pooled set of threads like the second option, but they use WaitHandles to wait for work and use a producer/consumer pattern. Basically, when the producer indicates that there is work, it sends a signal to a consumer (there will be a manager which will determine which thread will get the work, and then signal that thread) which will wake up and start working.
You might want to look into the Parallel Task Library. It's in beta now, but if you can use it and are comfortable with it, I would recommend it, as it will manage a great deal of this for you (and much better, taking into account the number of cores on a machine, the optimal number of threads, etc, etc).
The former solution (spawn a thread for each new piece of work), is easier to code, and not too bad, if the units of work are large enough.
The second solution (thread-pool, with a queue of work), is more complicated to code, but supports smaller units of work.
Instead of rolling your own solution, you should look at the ThreadPool class in the .NET framework. You could use the QueueUserWorkItem method. It should do exactly what you want to accomplish.

Categories