Multiple Threads - c#

I post a lot here regarding multithreading, and the great stackoverflow community have helped me alot in understand multithreading.
All the examples I have seen online only deal with one thread.
My application is a scraper for an insurance company (family company ... all free of charge). Anyway, the user is able to select how many threads they want to run. So lets say for example the user wants the application to scrape 5 sites at one time, and then later in the day he choses 20 threads because his computer isn't doing anything else so it has the resources to spare.
Basically the application builds a list of say 1000 sites to scrape. A thread goes off and does that and updates the UI and builds the list.
When thats finished another thread is called to start the scraping. Depending on the number of threads the user has set to use it will create x number of threads.
Whats the best way to create these threads? Should I create 1000 threads in a list. And loop through them? If the user has set 5 threads to run, it will loop through 5 at a time.
I understand threading, but it's the application logic which is catching me out.
Any ideas or resources on the web that can help me out?

You could consider using a thread pool for that:
using System;
using System.Threading;
public class Example
{
public static void Main()
{
ThreadPool.SetMaxThreads(100, 10);
// Queue the task.
ThreadPool.QueueUserWorkItem(new WaitCallback(ThreadProc));
Console.WriteLine("Main thread does some work, then sleeps.");
Thread.Sleep(1000);
Console.WriteLine("Main thread exits.");
}
// This thread procedure performs the task.
static void ThreadProc(Object stateInfo)
{
Console.WriteLine("Hello from the thread pool.");
}
}

This scraper, does it use a lot of CPU when its running?
If it does a lot of communication with these 1000 remote sites, downloading their pages, that may be taking more time than the actual analysis of the pages.
And how many CPU cores does your user have? If they have 2 (which is common these days) then beyond two simultaneous threads performing analysis, they aren't going to see any speed up.
So you probably need to "parallelize" the downloading of the pages. I doubt you need to do the same for the analysis of the pages.
Take a look into asynchronous IO, instead of explicit multi-threading. It lets you launch a bunch of downloads in parallel and then get called back when each one completes.

If you really just want the application, use something someone else already spent time developing and perfecting:
http://arachnode.net/
arachnode.net is a complete and comprehensive .NET web crawler for
downloading, indexing and storing
Internet content including e-mail
addresses, files, hyperlinks, images,
and Web pages.
Whether interested or involved in
screen scraping, data mining, text
mining, research or any other
application where a high-performance
crawling application is key to the
success of your endeavors,
arachnode.net provides the solution
you need for success.
If you also want to write one yourself because it's a fun thing to write (I wrote one not long ago, and yes, it is alot of fun ) then you can refer to this pdf provided by arachnode.net which really explains in detail the theory behind a good web crawler:
http://arachnode.net/media/Default.aspx?Sort=Downloads&PageIndex=1
Download the pdf entitled: "Crawling the Web" (second link from top). Scroll to Section 2.6 entitled: "2.6 Multi-threaded Crawlers". That's what I used to build my crawler, and I must say, I think it works quite well.

I think this example is basically what you need.
public class WebScraper
{
private readonly int totalThreads;
private readonly List<System.Threading.Thread> threads;
private readonly List<Exception> exceptions;
private readonly object locker = new object();
private volatile bool stop;
public WebScraper(int totalThreads)
{
this.totalThreads = totalThreads;
threads = new List<System.Threading.Thread>(totalThreads);
exceptions = new List<Exception>();
for (int i = 0; i < totalThreads; i++)
{
var thread = new System.Threading.Thread(Execute);
thread.IsBackground = true;
threads.Add(thread);
}
}
public void Start()
{
foreach (var thread in threads)
{
thread.Start();
}
}
public void Stop()
{
stop = true;
foreach (var thread in threads)
{
if (thread.IsAlive)
{
thread.Join();
}
}
}
private void Execute()
{
try
{
while (!stop)
{
// Scrap away!
}
}
catch (Exception ex)
{
lock (locker)
{
// You could have a thread checking this collection and
// reporting it as you see fit.
exceptions.Add(ex);
}
}
}
}

The basic logic is:
You have a single queue in which you put the URLs to scrape then you create your threads and use a queue object to which every thread has access. Let the threads start a loop:
lock the queue
check if there are items in the queue, if not, unlock queue and end thread
dequeue first item in the queue
unlock queue
process item
invoke an event that updates the UI (Remember to lock the UI Controller)
return to step 1
Just let the Threads do the "get stuff from the queue" part (pulling the jobs) instead of giving them the urls (pushing the jobs), that way you just say
YourThreadManager.StartThreads(numberOfThreadsTheUserWants);
and everything else happens automagically. See the other replies to find out how to create and manage the threads .

I solved a similar problem by creating a worker class that uses a callback to signal the main app that a worker is done. Then I create a queue of 1000 threads and then call a method that launches threads until the running thread limit is reached, keeping track of the active threads with a dictionary keyed by the thread's ManagedThreadId. As each thread completes, the callback removes its thread from the dictionary and calls the thread launcher.
If a connection is dropped or times out, the callback reinserts the thread back into the queue. Lock around the queue and the dictionary. I create threads vs using the thread pool because the overhead of creating a thread is insignificant compared to the connection time, and it allows me to have a lot more threads in flight. The callback also provides a convenient place with which to update the user interface, even allowing you to change the thread limit while it's running. I've had over 50 open connections at one time. Remember to increase your MacConnections property in your app.config (default is two).

I would use a queue and a condition variable and mutex, and start just the requested number of threads, for example, 5 or 20 (and not start 1,000).
Each thread blocks on the condition variable. When woken up, it dequeues the first item, unlocks the queue, works with the item, locks the queue and checks for more items. If the queue is empty, sleep on the condition variable. If not, unlock, work, repeat.
While the mutex is locked, it can also check if the user has requested the count of threads to be reduced. Just check if count > max_count, and if so, the thread terminates itself.
Any time you have more sites to queue, just lock the mutex and add them to the queue, then broadcast on the condition variable. Any threads that are not already working will wake up and take new work.
Any time the user increases the requested thread count, just start them up and they will lock the queue, check for work, and either sleep on the condition variable or get going.
Each thread will be continually pulling more work from the queue, or sleeping. You don't need more than 5 or 20.

Consider using the event-based asynchronous pattern (AsyncOperation and AsyncOperationManager Classes)

You might want to take a look at the ProcessQueue article on CodeProject.
Essentially, you'll want to create (and start) the number of threads that are appropriate, in your case that number comes from the user. Each of these threads should process a site, then find the next site needed to process. Even if you don't use the object itself (though it sounds like it would suit your purposes pretty well, though I'm obviously biased!) it should give you some good insight into how this sort of thing would be done.

Related

Python timeout lock with option to stop

I using python 3.6 for sync multiple threads. I have a "master thread" that gives work for all the other threads. When a worker thread is finish work, it signal the master thread to give him more work.
In order to achive that, the master thread is waiting for one (or more) threads to finish before collecting new data to process.
while True:
while freeWorkers > 0:
# Give the worker more work...
time.sleep(5) # wait for 5 seconds before checking if we got free workers.
Basiclly, it's working. I want to upgrade it in that way: after a worker finish it job, it report some how to the "master" thread. Because master thread is really quick, in most cases the master thread will be sleeping... I want to make him stop sleeping, what will trigger giving more work for the free workers.
In C#, I did this trick in that way:
An object to handle the syncing around
public object SyncingClock { get; private set; } = new object();
Entering sleep in that way:
lock (SyncingClock)
Monitor.Wait(SyncingClock, 5000);
Worker thread report completion in that way:
lock (SyncingClock)
Monitor.Pulse(SyncingClock);
So, I looking to way to perform this C# trick in Python (or any other alternative).
Thanks.
i think you should look at eventdriven programming (https://emptypage.jp/notes/pyevent.en.html)
and not having a while loop polling for finished workers:
for example something like this:
def create_thread(self, work_finished_method):
t = some_method_to_create_and prepare_a_thread()
t.event_finished += work_finished_method
return t
class MyThread:
name = "SomeNameForTheThread"
event_finished = event.Event(name + " has finished.")
def finished(self):
self.event_finished()
def do_work:
do_something()
finished()
and when the work_finished method is called in the mainhthread you can assign new work to the thread.
This done with Condition object.
self.conditon = threading.Condition()
For waiting to timeout or pulse, do:
with service.conditon:
service.conditon.wait(5)
For notify:
with self.conditon:
self.conditon.notifyAll()

Threading synchronization issue, 3 threads running concurrently, the 4th must run while others are waiting

Sorry for the title, I couldn't find better to explain my issue...
I'm having a hard time trying to synchronize different threads in my application. It's probably an easy problem for someone that has a new look on the issue, but after hours of investigations about a deadlock, my head is exploding and I can't find a good and safe way to write my synchronization mechanism :(
Basically, I have a .Net process that runs in multiple threads (everything in a single process, so no need for IPC). I have 4 threads:
1 thread, say it is called SpecificThread. There is a System.Timers.Timer that periodically executes some code.
3 other threads, each running a service that executes some code periodically (while (true) loop + Thread.Sleep(few ms)).
All 3 services must run concurrently. I guarantee their concurrent execution is thread safe.
The fourth thread, SpecificThread, must execute its code periodically, but it must block the execution of the 3 other services.
So basically I have SpecificThread that executes code periodically. When SpecificThread wants to execute its code periodically, it must wait for other services to complete their task. When all other 3 services completed their task, it must execute its SpecificCode while other 3 services are blocked. When its SpecificCode is executed, other 3 services can run their code again.
I have a shared instance of a SynchronizationContext object that is shared between all 4 threads. I can use it to synchronize my threads:
public class SynchronizationContext
{
public void StartService1()
{
...
}
public void StopService1()
{
...
}
...
public void StartSpecificCode()
{
// Some sync here that wait until all 3 services completed their
// respective tasks
}
public void NotifySpecificCodeCompleted()
{
// Some sync here that allows services 1 to 3 to execute again
}
}
The 3 services execution mechanism looks like:
// Only exits the loop when stopping the whole .Net process
while (myService.IsRunning)
{
try
{
this.synchronizationContext.StartService1();
// Do some job
}
finally
{
this.synchronizationContext.EndService1();
// Avoids too much CPU usage for nothing in the loop
Thread.Sleep(50);
}
}
The SpecificThread execution mechanism:
// System.Timers.Timer that is instantiated on process start
if (this.timer != null)
{
this.timer.Stop();
}
try
{
// Must blocks until computation is possible
this.synchronizationContext.StartSpecificCode();
// Some job here that must execute while other 3
// services are waiting
}
finally
{
// Notify computation is done
this.synchronizationContext.NotifySpecificCodeCompleted();
// Starts timer again
if (this.timer != null)
{
this.timer.Start();
}
}
I can't figure out how to use critical sections as only SpecificThread must run while other are waiting. I didn't found a way with Semaphore nor AutoResetEvent (their usage introduced a hard-to-debug deadlock in my code). I'm running out of ideas here... Maybe Interlocked static methods would help?
Last word: my code must run with .Net 3.5, I can't use any TPL nor CountdownEvent classes...
Any help is appreciated!
ReaderWriterLockSlim sounds like exactly the tool that will help you the most. Have each of the services take out a read lock inside the body of their loop:
while (true)
{
try
{
lockObject.EnterReadLock();
//Do stuff
}
finally
{
lockObject.ExitReadLock();
}
}
Then your fourth thread can enter a write lock when it wants to do it's work. The way reader/writer locks work is that any number of readers can hold a lock, as long as no writers hold the lock, and there can only be one writer holding the lock at a time. This means that none of the three workers will block other workers, but the workers will bock if the fourth thread is running, which is exactly what you want.

Section of code with up to N threads executing in FIFO order

I have a section of code which should be executed by a maximum number of threads lower than N and also the order in which threads are calling someFunction() should be reflected in the order in which they are entering the section, that is FIFO order.
If I use the Semaphore I have no control over the order in which threads are entering the section.
"There is no guaranteed order, such as FIFO or LIFO, in which blocked
threads enter the semaphore."
The initial attempt:
class someClass
{
static volatile Semaphore semaphore;
...
someClass()
{
semaphore = new Semaphore(N,N)
}
someType someFunction(InputType input)
{
try
{
semaphore.WaitOne();
/* Section Begins */
var response = someHeavyJob(input); // submitted to the server
return response;
/* Section Ends */
}
finally
{
semaphore.Release();
}
}
}
If I combine a Semaphore and a ConcurrentQueue as follows thread may come back with a response to the request brought by other thread what would require significant changes in other parts of code.
What is the .NET 4.5 solution for the following problem:
Allow for maximum number of threads lower than N in the section of code
The order in which threads are entering the section is FIFO
Threads will get the response for the request they brought (and not the response to the request brought by other threads)
class someClass
{
static volatile ConcurrentQueue<someType> cqueue;
static volatile Semaphore semaphore;
...
someClass()
{
cqueue = new ConcurrentQueue<someType>();
semaphore = new Semaphore(N,N)
}
someType someFunction(Request request)
{
try
{
cqueue.enqueue(request);
semaphore.WaitOne();
Request newrequest;
cqueue.TryDequeue(out newrequest);
/* Section Begins */
var response = someHeavyJob(Request newrequest); // submitted to the server
return response;
/* Section Ends */
}
finally
{
semaphore.Release();
}
}
}
UPDATE:
I am clarifying my question:
SomeHeavyJobs() funciton is a blocking call to the server on which this job is being processed.
UPDATE2:
Thank you all for answers. For the record: I ended up using the FIFO Semaphore
'If I combine a Semaphore and a ConcurrentQueue as follows thread may come back with a response to the request brought by other thread what would require significant changes in other parts of code.'
I hate to say it, but I would suggest 'changes in other parts of code', even though I don't know how much 'significance' this would have.
Typicaly, such a requirement is met as you suggested, by queueing messages that contain a reference to the originating class instance so that responses can be 'returned' to the object that requested them. If the originators are all descended from some 'messagehandler' class, that makes it easier on the thread that will call the function, (which should be a member of messagehandler). Once the thread/s have performed the function, they can call a 'onCompletion' method of the messagehandler. 'onCompletion' could either signal an event that the originator is waiting on, (synchronous), or queue something to a private P-C queue of the originator, (asynchronous).
So, a BlockingCollection, one consumer thread and judicious use of C++/C# inheritance/polymorphism should do the job.
Strangely, this is almost exactly what I am being forced into with my current embedded ARM project. The command-line interface thread used for config/debug/log is now so large that it needs a massive 600 words of stack, even in 'Thumb, Optimize of size' mode. It can no longer be permitted to call the SD filesystem directly and must now queue itself to the thread that runs the SD card, (which has the largest stack in the system to run FAT32), and wait on a semaphore for the SD thread to call its methods and signal the semaphore when done.
This is the classic way of ensuring that the calls are made sequentially and will cetainly work. It's basicaly a threadpool with only one thread.
Like the other posters have written, any other approach is likely to be, err.. 'brave'.
also the order in which threads are calling someFunction() should be
reflected in the order in which they are entering the section, that is
FIFO order
This is not possible by principle.
semaphore.WaitOne(); //#1
var response = someHeavyJob(input); //#2
Even is Semaphore was strictly FIFO, the following could happen:
All threads enter the section in FIFO order (1)
All threads get descheduled from the CPU (between 1 and 2)
All threads get rescheduled in random order or even in LIFO order (between 1 and 2)
All thread start entering someHeavyJob in arbitrary order (2)
You can never ensure that the threads will "enter" the function in a specific order.
As for a FIFO semaphore, you can build a Semaphore yourself using a lock and a Queue. Looks like your already did that and posted the code. This approach is correct as far as I can tell.
Have you looked at Smart Thread Pool?
[Edit]
If I am still getting the problem right, as I've stated in the comments, I don't believe that a multithreaded solution is feasible for this problem.
If a task k cannot be started before task k-1 has finished, then you only need a single thread to execute them. If you are allowed to execute some combinations of tasks in parallel, then you need to specify the rules exactly.

Limiting the number of threadpool threads

I am using ThreadPool in my application. I have first set the limit of the thread pool by using the following:
ThreadPool.SetMaxThreads(m_iThreadPoolLimit,m_iThreadPoolLimit);
m_Events = new ManualResetEvent(false);
and then I have queued up the jobs using the following
WaitCallback objWcb = new WaitCallback(abc);
ThreadPool.QueueUserWorkItem(objWcb, m_objThreadData);
Here abc is the name of the function that I am calling.
After this I am doing the following so that all my threads come to 1 point and the main thread takes over and continues further
m_Events.WaitOne();
My thread limit is 3. The problem that I am facing is, inspite of the thread pool limit set to 3, my application is processing more than 3 files at the same time, whereas it was supposed to process only 3 files at a time. Please help me solve this issue.
What kind of computer are you using?
From MSDN
You cannot set the number of worker
threads or the number of I/O
completion threads to a number smaller
than the number of processors in the
computer.
If you have 4 cores, then the smallest you can have is 4.
Also note:
If the common language runtime is
hosted, for example by Internet
Information Services (IIS) or SQL
Server, the host can limit or prevent
changes to the thread pool size.
If this is a web site hosted by IIS then you cannot change the thread pool size either.
A better solution involves the use of a Semaphore which can throttle the concurrent access to a resource1. In your case the resource would simply be a block of code that processes work items.
var finished = new CountdownEvent(1); // Used to wait for the completion of all work items.
var throttle = new Semaphore(3, 3); // Used to throttle the processing of work items.
foreach (WorkItem item in workitems)
{
finished.AddCount();
WorkItem capture = item; // Needed to safely capture the loop variable.
ThreadPool.QueueUserWorkItem(
(state) =>
{
throttle.WaitOne();
try
{
ProcessWorkItem(capture);
}
finally
{
throttle.Release();
finished.Signal();
}
}, null);
}
finished.Signal();
finished.Wait();
In the code above WorkItem is a hypothetical class that encapsulates the specific parameters needed to process your tasks.
The Task Parallel Library makes this pattern a lot easier. Just use the Parallel.ForEach method and specify a ParallelOptions.MaxDegreesOfParallelism that throttles the concurrency.
var options = new ParallelOptions();
options.MaxDegreeOfParallelism = 3;
Parallel.ForEach(workitems, options,
(item) =>
{
ProcessWorkItem(item);
});
1I should point out that I do not like blocking ThreadPool threads using a Semaphore or any blocking device. It basically wastes the threads. You might want to rethink your design entirely.
You should use Semaphore object to limit concurent threads.
You say the files are open: are they actually being actively processed, or just left open?
If you're leaving them open: Been there, done that! Relying on connections and resources (it was a DB connection in my case) to close at end of scope should work, but it can take for the dispose / garbage collection to kick in.

Suggest an Improved approach for Multi-threaded application

I am building up a multi-threaded application where I spawn three threads when application starts and these threads continue to run for application lifetime. All my threads are exclusive and do not interfere with each other in anyway. Now a user can suspend the application and, here I want to suspend or, say, abort my threads.
I am currently spawning threads as foreground threads, but I guess changing them to background threads wouldn't affect my application anyway (except they(foreground threads) would keep the application alive until they finish).
I would ask people here to suggest an approach to suspend the application via thread.suspend() or thread.abort(). I know thread.suspend is obsolete and risky, but is it harmful for my application also where I am not using any type of synchronization.
PS: My threads are saving and retrieving some data to & from embedded database(sqlite) every minute.
Use the Blocking mechanisms like WaitHandles (ManualResetEvent, AutoResetEvent), Monitor, Semaphore etc...
Andrew
P.S. the question is quite broad so I would ultimately recommend reading up on proven practices and principles of Multi Threading which will include synchronization. Your requirements do not sound too complex so I am sure you will be able to research the best way which suits your needs.
You could create a mutex and let the threads wait for a signal on that mutex. This way your threads are not destroyed but they will sleep almost without consuming resources.
Mutex.WaitOne
I always use ManualResetEvent for this:
class Myclass
{
ManualResetEvent _event;
Thread _thread;
public void Start()
{
_thread = new Thread(WorkerThread);
_thread.IsBackground = true;
_thread.Start();
}
public void Stop()
{
_event.Set();
if (!_thread.Join(5000))
_thread.Abort();
}
private void WorkerThread()
{
while (true)
{
// wait 5 seconds, change to whatever you like
if (_event.WaitOne(5000))
break; // signalled to stop
//do something else here
}
}
}
Actually, this situation is the only one where Thread.Suspend does make sense. The reason it's obsoleted is because people misuse it to fake synchronization, or use it on threads they do not own (e.g., ThreadPool threads).

Categories