I am building a web scraper in C# that deals with proxies and a large volume of requests. The pages are loaded through a ConnectionManager class that grabs a proxy and retries loading that page with random proxies until the page is correctly loaded.
On average, a single task will take somewhere between 100 and 300 requests, and to speed up the process, I have designed the method to use multithreading to simultaneously download the webpages.
public Review[] getReviewsMultithreaded(int reviewCount)
{
ArrayList reviewList = new ArrayList();
int currentIndex = 0;
int currentPage = 1;
int totalPages = (reviewCount / 10) + 1;
bool threadHasMoreWork = true;
Object pageLock = new Object();
Thread[] threads = new Thread[Program.maxScraperThreads];
for(int i = 0; i < Program.maxScraperThreads; i++)
{
threads[i] = (new Thread(() =>
{
while (threadHasMoreWork)
{
HtmlDocument doc;
lock(pageLock)
{
if (currentPage <= totalPages)
{
string builtString = "http://www.example.com/reviews/" + _ID + "?pageNumber=" + currentPage;
//Log.WriteLine(builtString);
currentPage++;
doc = Program.conManager.loadDocument(builtString);
}
else
{
threadHasMoreWork = false;
continue;
}
}
try
{
//Get info from page and add to list
reviewList.Add(cRev);
}
Log.WriteLine(_asin + " reviews scraped: " + reviewList.Count);
}
catch (Exception ex) { continue; }
}
}));
threads[i].Start();
}
bool threadsAreRunning = true;
while(threadsAreRunning) //this is in a separate thread itself, so as not to interrupt the GUI
{
threadsAreRunning = false;
foreach (Thread t in threads)
if (t.IsAlive)
{
threadsAreRunning = true;
Thread.Sleep(2000);
}
}
//flatten the arraylist to a primitive
return reviewArray;
}
However, I have noticed that the requests are still largely being handled one at a time, and as a result the method isn't much faster than it was before. Is the lock causing problems? Is the fact that the ConnectionManager is instantiated in one object and each thread is calling the loadDocument from the same object?
Ah, nevermind. I noticed the lock included the call to the method that loads the pages, and because of that only one page was loading at a time.
Related
I am calling a VB 6.0 dll in Parallel.ForEach and expecting all calls to be started simultaneously or at least 2 of them based on my PC's cores or threads availability in thread pool
VB6 dll
Public Function DoJunk(ByVal counter As Long, ByVal data As String) As Integer
Dim i As Long
Dim j As Long
Dim s As String
Dim fno As Integer
fno = FreeFile
Open "E:\JunkVB6Dll\" & data & ".txt" For Output Access Write As #fno
Print #fno, "Starting loop with counter = " & counter
For i = 0 To counter
Print #fno, "counting " & i
Next
Close #fno
DoJunk = 1
End Function
counter is being passed from the caller to control execution time of the call and file is being written to make it an IO based process.
C# caller
private void ReportProgress(int value)
{
progressBar.Value = value;
//progressBar.Value++;
}
private void button1_Click(object sender, EventArgs e)
{
progressBar.Value = 0;
counter = 0;
Stopwatch watch = new Stopwatch();
watch.Start();
//var range = Enumerable.Range(0, 100);
var range = Enumerable.Range(0, 20);
bool finished = false;
Task.Factory.StartNew(() =>
{
Parallel.ForEach(range, i =>
{
#region COM CALL
JunkProject.JunkClass junk = new JunkProject.JunkClass();
try
{
Random rnd = new Random();
int dice = rnd.Next(10, 40);
int val = 0;
if (i == 2)
val = junk.DoJunk(9000000, i.ToString());
else
val = junk.DoJunk(dice * 10000, i.ToString());
System.Diagnostics.Debug.Print(junk.GetHashCode().ToString());
if (val == 1)
{
Interlocked.Increment(ref counter);
progressBar.Invoke((Action)delegate { ReportProgress(counter); });
}
junk = null;
}
catch (Exception excep)
{
i = i;
}
finally { junk = null; }
#endregion
});
}).ContinueWith(t =>
{
watch.Stop();
MessageBox.Show(watch.ElapsedMilliseconds.ToString());
});
}
This line is making a specific call longer than the others.
val = junk.DoJunk(9000000, i.ToString());
Here this second process is causing all calls inside the Parallel.ForEach to stop i.e. no other file is created unless this 2nd call gets completed.
Is it an expected behavior or i am doing something wrong?
As #John Wu suggested that you can create AppDomain to allow COM to run on different App Domain, I believe you could run your parallel like this.
Parallel.ForEach(range, i =>
{
AppDomain otherDomain = AppDomain.CreateDomain(i.ToString());
otherDomain.DoCallBack(delegate
{
//Your COM call
});
});
EDIT
Right.. I am not sure how can you set serializable on VB6.0 class. You can try the other way (Marshaling objects by reference). Noted: I haven't actually tested this, but I would like to know if that will work.
Parallel.ForEach(range, i =>
{
AppDomain otherDomain = AppDomain.CreateDomain(i.ToString());
var comCall = (ComCall) otherDomain.CreateInstanceFromAndUnwrap(Assembly.GetExecutingAssembly().Location, typeof(ComCall).ToString());
comCall.Run();
AppDomain.Unload(otherDomain);
});
and the class
public class ComCall : MarshalByRefObject
{
public void Run()
{
//Your COM Call
}
}
Here is also additional reference regarding the topic.
https://www.codeproject.com/Articles/14791/NET-Remoting-with-an-easy-example
I'm trying to make a tool that get source string from many URL I provided. And I use this code for multithreading
new Thread(() =>
{
while (stop != true)
{
if (nowworker >= threads)
{
Thread.Sleep(50);
}
else
{
if (i <= urllist.Count - 1)
{
var thread = new Thread(() =>
{
string source = GetSource(urllist[i]);
SaveToFile(source, i + ".txt");
});
thread.Start();
i++;
nowworker += 1;
}
else
{
stop = true;
}
}
}
}).Start();
It's run very smooth until I check the result and have some duplicated result and missing some url I provided if using less thread for many url(10 thread - 20 url) but there's no problem when using 20 thread for 20 url.
Please help me. Thank you.
if (i <= urllist.Count - 1)
{
var thread = new Thread(() =>
{
string source = GetSource(urllist[i]);
SaveToFile(source, i + ".txt");
});
thread.Start();
i++;
nowworker += 1;
}
The method you're passing to the thread is not guaranteed to execute before i is updated (the i++). Infact, it's very unlikely that it will. This means that multiple threads may use the same value of i, and some values of i will not have any threads executing it.
Even worse, GetSource may use a different value of i than SaveToFile.
Have a readup here: http://jonskeet.uk/csharp/csharp2/delegates.html
This will fix it:
if (i <= urllist.Count - 1)
{
var currentIndex = i;
var thread = new Thread(() =>
{
string source = GetSource(urllist[currentIndex]);
SaveToFile(source, currentIndex + ".txt");
});
thread.Start();
i++;
nowworker += 1;
}
Even better, you can replace the entire block of code with this:
Parallel.For(0, urlList.Count - 1,
new ParallelOptions { MaxDegreeOfParallelism = threads },
i =>
{
string source = GetSource(urllist[i]);
SaveToFile(source, i + ".txt");
}
);
Which will get rid of the code-smelly Thread.Sleep() and let .NET manage spinning up threads for you
This is further to my question here
By doing some reading .... I moved away from Semaphores to ThreadPool.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading;
namespace ThreadPoolTest
{
class Data
{
public int Pos { get; set; }
public int Num { get; set; }
}
class Program
{
static ManualResetEvent[] resetEvents = new ManualResetEvent[20];
static void Main(string[] args)
{
int s = 0;
for (int i = 0; i < 100000; i++)
{
resetEvents[s] = new ManualResetEvent(false);
Data d = new Data();
d.Pos = s;
d.Num = i;
ThreadPool.QueueUserWorkItem(new WaitCallback(Process), (object)d);
if (s >= 19)
{
WaitHandle.WaitAll(resetEvents);
Console.WriteLine("Press Enter to Move forward");
Console.ReadLine();
s = 0;
}
else
{
s = s + 1;
}
}
}
private static void Process(object o)
{
Data d = (Data) o;
Console.WriteLine(d.Num.ToString());
Thread.Sleep(10000);
resetEvents[d.Pos].Set();
}
}
}
This code works and I am able to process in the sets of 20. But I don't like this code because of WaitAll. So let's say I start a batch of 20, and 3 threads take longer time while 17 have finished. Even then I will keep the 17 threads as waiting because of the WaitAll.
WaitAny would have been good... but it seems rather messy that I will have to build so much of control structures like Stacks, Lists, Queues etc in order to use the pool efficiently.
The other thing I don't like is that whole global variable in the class for resetEvents. because this array has to be shared between the Process method and the main loop.
The above code works... but I need your help in improving it.
Again... I am on .NET 2.0 VS 2008. I cannot use .NET 4.0 parallel/async framework.
There are several ways you can do this. Probably the easiest, based on what you've posted above, would be:
const int MaxThreads = 4;
const int ItemsToProcess = 10000;
private Semaphore _sem = new Semaphore(MaxThreads, MaxThreads);
void DoTheWork()
{
int s = 0;
for (int i = 0; i < ItemsToProcess; ++i)
{
_sem.WaitOne();
Data d = new Data();
d.Pos = s;
d.Num = i;
ThreadPool.QueueUserWorkItem(Process, d);
++s;
if (s >= 19)
s = 0;
}
// All items have been assigned threads.
// Now, acquire the semaphore "MaxThreads" times.
// When counter reaches that number, we know all threads are done.
int semCount = 0;
while (semCount < MaxThreads)
{
_sem.WaitOne();
++semCount;
}
// All items are processed
// Clear the semaphore for next time.
_sem.Release(semCount);
}
void Process(object o)
{
// do the processing ...
// release the semaphore
_sem.Release();
}
I only used four threads in my example because that's how many cores I have. It makes little sense to be using 20 threads when only four of them can be processing at any one time. But you're free to increase the MaxThreads number if you like.
So I'm pretty sure this is all .NET 2.0.
We'll start out defining Action, because I'm so used to using it. If using this solution in 3.5+, remove that definition.
Next, we create a queue of actions based on the input.
After that we define a callback; this callback is the meat of the method.
It first grabs the next item in the queue (using a lock since the queue isn't thread safe). If it ended up having an item to grab it executes that item. Next it adds a new item to the thread pool which is "itself". This is a recursive anonymous method (you don't come across uses of that all that often). This means that when the callback is called for the first time it will execute one item, then schedule a task which will execute another item, and that item will schedule a task that executes another item, and so on. Eventually the queue will run out, and they'll stop queuing more items.
We also want the method to block until we're all done, so for that we keep track of how many of these callbacks have finished through incrementing a counter. When that counter reaches the task limit we signal the event.
Finally we start N of these callbacks in the thread pool.
public delegate void Action();
public static void Execute(IEnumerable<Action> actions, int maxConcurrentItems)
{
object key = new object();
Queue<Action> queue = new Queue<Action>(actions);
int count = 0;
AutoResetEvent whenDone = new AutoResetEvent(false);
WaitCallback callback = null;
callback = delegate
{
Action action = null;
lock (key)
{
if (queue.Count > 0)
action = queue.Dequeue();
}
if (action != null)
{
action();
ThreadPool.QueueUserWorkItem(callback);
}
else
{
if (Interlocked.Increment(ref count) == maxConcurrentItems)
whenDone.Set();
}
};
for (int i = 0; i < maxConcurrentItems; i++)
{
ThreadPool.QueueUserWorkItem(callback);
}
whenDone.WaitOne();
}
Here's another option that doesn't use the thread pool, and just uses a fixed number of threads:
public static void Execute(IEnumerable<Action> actions, int maxConcurrentItems)
{
Thread[] threads = new Thread[maxConcurrentItems];
object key = new object();
Queue<Action> queue = new Queue<Action>(actions);
for (int i = 0; i < maxConcurrentItems; i++)
{
threads[i] = new Thread(new ThreadStart(delegate
{
Action action = null;
do
{
lock (key)
{
if (queue.Count > 0)
action = queue.Dequeue();
else
action = null;
}
if (action != null)
{
action();
}
} while (action != null);
}));
threads[i].Start();
}
for (int i = 0; i < maxConcurrentItems; i++)
{
threads[i].Join();
}
}
I recently started on a new job and there is a windows service here that consumes messages from a private windows queue. This service consumes the messages only from 9am to 6pm. So, during 7pm to 8:59am it accumulates a lot of messages on the queue. When it starts processing at 9pm, the cpu usage of the service goes to high(98, 99 percent), screwing with the server's performance.
This service use threads to process the messages of the queue, but as I had never worked with threads before I am a little lost.
Here's the part of code that I am sure this is happening:
private Thread[] th;
//in the constructor of the class, the variable th is initialized like this:
this.th = new Thread[4];
//the interval of this method calling is 1sec, it only goes high cpu usage when there is a lot of messages in the queue
public void Exec()
{
try
{
AutoResetEvent autoEvent = new AutoResetEvent(false);
int vQtd = queue.GetAllMessages().Length;
while (vQtd > 0)
{
for (int y = 0; y < th.Length; y++)
{
if (this.th[y] == null || !this.th[y].IsAlive)
{
this.th[y] = new Thread(new ParameterizedThreadStart(ProcessMessage));
this.th[y].Name = string.Format("Thread_{0}", y);
this.th[y].Start(new Controller(queue.Receive(), autoEvent));
vQtd--;
}
}
}
}
catch (Exception ex)
{
ExceptionPolicy.HandleException(ex, "RECOVERABLE");
}
}
EDIT: I am trying the second approach posted by Brian Gideon. But I'll by honest: I'm deeply confused with the code and I don't have a clue about what it's doing.
I haven't changed the way the 4 threads are created and the other code I showed, just changed my Exec(exec is the method called every second when it's 9am to 6pm) method to this:
public void Exec()
{
try
{
AutoResetEvent autoEvent = new AutoResetEvent(false);
int vQtd = queue.GetAllMessages().Length;
while (vQtd > 0)
{
for (int i = 0; i < 4; i++)
{
var thread = new Thread(
(ProcessMessage) =>
{
while (true)
{
Message message = queue.Receive();
Controller controller = new Controller(message, autoEvent);
//what am I supposed to do with the controller?
}
});
thread.IsBackground = true;
thread.Start();
}
vQtd--;
}
}
catch (Exception ex)
{
ExceptionPolicy.HandleException(ex, "RECOVERABLE");
}
}
Ouch. I have to be honest. That is not a very good design. It could very well be spinning around that while loop waiting for previous threads to finish processing. Here is a much better way of doing it. Notice that the 4 threads are only created once and hang around forever. The code below uses the BlockingCollection from the .NET 4.0 BCL. If you are using an earlier version you can replace it with Stephen Toub's BlockingQueue.
Note: Further refactoring may be warranted in your case. This code tries to preserve some common elements from the original.
public class Example
{
private BlockingCollection<Controller> m_Queue = new BlockingCollection<Controller>();
public Example()
{
for (int i = 0; i < 4; i++)
{
var thread = new Thread(
() =>
{
while (true)
{
Controller controller = m_Queue.Take();
// Do whatever you need to with Contoller here.
}
});
thread.IsBackground = true;
thread.Start();
}
}
public void Exec()
{
try
{
AutoResetEvent autoEvent = new AutoResetEvent(false);
int vQtd = Queue.GetAllMessages().Length
while (vQtd > 0)
{
m_Queue.Add(new Controller(Queue.Receive(), autoEvent));
}
}
catch (Exception ex)
{
ExceptionPolicy.HandleException(ex, "RECOVERABLE");
}
}
}
Edit:
Or better yet since MessageQueue is thread-safe:
public class Example
{
public Example()
{
for (int i = 0; i < 4; i++)
{
var thread = new Thread(
() =>
{
while (true)
{
if (/* between 9am and 6pm */)
{
Message message = queue.Receive();
Controller controller = new Controller(message, /* AutoResetEvent? */);
// Do whatever you need to with Contoller here.
// Is the AutoResetEvent really needed?
}
}
});
thread.IsBackground = true;
thread.Start();
}
}
}
The method you show runs in a tight loop when all threads are busy. Try something like this:
while (vQtd > 0)
{
bool full = true;
for (int y = 0; y < th.Length; y++)
{
if (this.th[y] == null || !this.th[y].IsAlive)
{
this.th[y] = new Thread(new ParameterizedThreadStart(ProcessMessage));
this.th[y].Name = string.Format("Thread_{0}", y);
this.th[y].Start(new Controller(queue.Receive(), autoEvent));
vQtd--;
full = false;
}
}
if (full)
{
Thread.Sleep(500); // Or whatever it may take for a thread to become free.
}
}
You have two options. Either you insert delays after each message with Thread.Sleep() or lower the thread priority of the polling threads. If you lower the thread priority the CPU usage will still be high, but should not affect performance that much.
Edit: or you can lower the number of threads from 4 to 3 to leave one core for other processing (assuming you have a quad core). This of course reduces your dequeuing throughput.
Edit2: or you could rewrite the whole think with task parallel library if you are running .NET 4. Look for Parallel.ForEach(). That should save you from some of the footwork if you are not familiar with threads.
There is a string array myDownloadList containing 100 string URIs. I want to start 5 thread jobs that will pop next URI from myDownloadList (like a stack) and do something with it (download it), until there is no URIs left on a stack (myDownloadList).
What would be the best practice to do this?
Use the ThreadPool, and just setup all of your requests. The ThreadPool will automatically schedule them appropriately.
This will get easier with .NET 4, using the Task Parallel Library. Setting up each request as a Task is very efficient and easy.
Make sure each thread locks the myDownloadList when accessing it. You could use recursion to keep getting the latest one, then when the list is 0 it can just stop the function.
See the example below.
public static List<string> MyList { get; set; }
public static object LockObject { get; set; }
static void Main(string[] args)
{
Console.Clear();
Program.LockObject = new object();
// Create the list
Program.MyList = new List<string>();
// Add 100 items to it
for (int i = 0; i < 100; i++)
{
Program.MyList.Add(string.Format("Item Number = {0}", i));
}
// Start Threads
for (int i = 0; i < 5; i++)
{
Thread thread = new Thread(new ThreadStart(Program.PopItemFromStackAndPrint));
thread.Name = string.Format("Thread # {0}", i);
thread.Start();
}
}
public static void PopItemFromStackAndPrint()
{
if (Program.MyList.Count == 0)
{
return;
}
string item = string.Empty;
lock (Program.LockObject)
{
// Get first Item
item = Program.MyList[0];
Program.MyList.RemoveAt(0);
}
Console.WriteLine("{0}:{1}", System.Threading.Thread.CurrentThread.Name, item);
// Sleep to show other processing for examples only
System.Threading.Thread.Sleep(10);
Program.PopItemFromStackAndPrint();
}