Multi-thread C# queue in .Net 4 - c#

I'm developing a simple crawler for web pages. I've searched an found a lot of solutions for implementing multi-threaded crawlers. What is is the best way to create a thread-safe queue to contain unique URLs?
EDIT:
Is there a better solution in .Net 4.5?

Use the Task Parallel Library and use the default scheduler which uses ThreadPool.
OK, this is a minimal implementation which queues 30 URLs at a time:
public static void WebCrawl(Func<string> getNextUrlToCrawl, // returns a URL or null if no more URLs
Action<string> crawlUrl, // action to crawl the URL
int pauseInMilli // if all threads engaged, waits for n milliseconds
)
{
const int maxQueueLength = 50;
string currentUrl = null;
int queueLength = 0;
while ((currentUrl = getNextUrlToCrawl()) != null)
{
string temp = currentUrl;
if (queueLength < maxQueueLength)
{
Task.Factory.StartNew(() =>
{
Interlocked.Increment(ref queueLength);
crawlUrl(temp);
}
).ContinueWith((t) =>
{
if(t.IsFaulted)
Console.WriteLine(t.Exception.ToString());
else
Console.WriteLine("Successfully done!");
Interlocked.Decrement(ref queueLength);
}
);
}
else
{
Thread.Sleep(pauseInMilli);
}
}
}
Dummy usage:
static void Main(string[] args)
{
Random r = new Random();
int i = 0;
WebCrawl(() => (i = r.Next()) % 100 == 0 ? null : ("Some URL: " + i.ToString()),
(url) => Console.WriteLine(url),
500);
Console.Read();
}

ConcurrentQueue is indeed the framework's thread-safe queue implementation. But since you're likely to use it in a producer-consumer scenario, the class you're really after may be the infinitely useful BlockingCollection.

Would System.Collections.Concurrent.ConcurrentQueue<T> fit the bill?

I'd use System.Collections.Concurrent.ConcurrentQueue.
You can safely queue and dequeue from multiple threads.

Look at System.Collections.Concurrent.ConcurrentQueue. If you need to wait, you could use System.Collections.Concurrent.BlockingCollection

Related

Limiting number of API requests started per second using Parallel.ForEach

I am working on improving some of my code to increase efficiency. In the original code I was limiting the number of threads allowed to be 5, and if I had already 5 active threads I would wait until one finished before starting another one. Now I want to modify this code to allow any number of threads, but I want to be able to make sure that only 5 threads get started every second. For example:
Second 0 - 5 new threads
Second 1 - 5 new threads
Second 2 - 5 new threads ...
Original Code (cleanseDictionary contains usually thousands of items):
ConcurrentDictionary<long, APIResponse> cleanseDictionary = new ConcurrentDictionary<long, APIResponse>();
ConcurrentBag<int> itemsinsec = new ConcurrentBag<int>();
ConcurrentDictionary<long, string> resourceDictionary = new ConcurrentDictionary<long, string>();
DateTime start = DateTime.Now;
Parallel.ForEach(resourceDictionary, new ParallelOptions { MaxDegreeOfParallelism = 5 }, row =>
{
lock (itemsinsec)
{
ThrottleAPIRequests(itemsinsec, start);
itemsinsec.Add(1);
}
cleanseDictionary.TryAdd(row.Key, _helper.MakeAPIRequest(string.Format("/endpoint?{0}", row.Value)));
});
private static void ThrottleAPIRequests(ConcurrentBag<int> itemsinsec, DateTime start)
{
if ((start - DateTime.Now).Milliseconds < 10001 && itemsinsec.Count > 4)
{
System.Threading.Thread.Sleep(1000 - (start - DateTime.Now).Milliseconds);
start = DateTime.Now;
itemsinsec = new ConcurrentBag<int>();
}
}
My first thought was increase the MaxDegreeofParallelism to something much higher and then have a helper method that will limit only 5 threads in a second, but I am not sure if that is the best way to do it and if it is, I would probably need a lock around that step?
Thanks in advance!
EDIT
I am actually looking for a way to throttle the API Requests rather than the actual threads. I was thinking they were one in the same.
Edit 2: My requirements are to send over 5 API requests every second
"Parallel.ForEach" from the MS website
may run in parallel
If you want any degree of fine control over how the threads are managed, this is not the way.
How about creating your own helper class where you can queue jobs with a group id, allows you to wait for all jobs of group id X to complete, and it spawns extra threads as and when required?
For me the best solution is:
using System;
using System.Collections.Concurrent;
using System.Threading.Tasks;
namespace SomeNamespace
{
public class RequestLimiter : IRequestLimiter
{
private readonly ConcurrentQueue<DateTime> _requestTimes;
private readonly TimeSpan _timeSpan;
private readonly object _locker = new object();
public RequestLimiter()
{
_timeSpan = TimeSpan.FromSeconds(1);
_requestTimes = new ConcurrentQueue<DateTime>();
}
public TResult Run<TResult>(int requestsOnSecond, Func<TResult> function)
{
WaitUntilRequestCanBeMade(requestsOnSecond).Wait();
return function();
}
private Task WaitUntilRequestCanBeMade(int requestsOnSecond)
{
return Task.Factory.StartNew(() =>
{
while (!TryEnqueueRequest(requestsOnSecond).Result) ;
});
}
private Task SynchronizeQueue()
{
return Task.Factory.StartNew(() =>
{
_requestTimes.TryPeek(out var first);
while (_requestTimes.Count > 0 && (first.Add(_timeSpan) < DateTime.UtcNow))
_requestTimes.TryDequeue(out _);
});
}
private Task<bool> TryEnqueueRequest(int requestsOnSecond)
{
lock (_locker)
{
SynchronizeQueue().Wait();
if (_requestTimes.Count < requestsOnSecond)
{
_requestTimes.Enqueue(DateTime.UtcNow);
return Task.FromResult(true);
}
return Task.FromResult(false);
}
}
}
}
I want to be able to send over 5 API request every second
That's really easy:
while (true) {
await Task.Delay(TimeSpan.FromSeconds(1));
await Task.WhenAll(Enumerable.Range(0, 5).Select(_ => RunRequestAsync()));
}
Maybe not the best approach since there will be a burst of requests. This is not continuous.
Also, there is timing skew. One iteration takes more than 1 second. This can be solved with a few lines of time logic.

Combining Thread with parameters and return value

So i have been multithreading lately,and since im new to this im probably doing something basic wrong..
Thread mainthread = new Thread(() => threadmain("string", "string", "string"));
mainthread.Start();
the above code works flawlessly but now i want to get a value back from my thread.
to do that i searched on SO and found this code:
object value = null;
var thread = new Thread(
() =>
{
value = "Hello World";
});
thread.Start();
thread.Join();
MessageBox.Show(value);
}
and i dont know how to combine the two.
the return value will be a string.
thank you for helping a newbie,i tried combining them but got errors due to my lack of experience
edit:
my thread:
public void threadmain(string url,string search, string regexstring)
{
using (WebClient client = new WebClient()) // WebClient class inherits IDisposable
{
string allthreadusernames = "";
string htmlCode = client.DownloadString(url);
string[] htmlarray = htmlCode.Split(new string[] { "\n", "\r\n" }, StringSplitOptions.RemoveEmptyEntries);
foreach (string line in htmlarray)
{
if (line.Contains(search))
{
var regex = new Regex(regexstring);
var matches = regex.Matches(line);
foreach (var singleuser in matches.Cast<Match>().ToList())
{
allthreadusernames = allthreadusernames + "\n" + singleuser.Groups[1].Value;
}
}
}
MessageBox.Show(allthreadusernames);
}
}
An easy solution would be to use another level of abstraction for asynchronous operations: Tasks.
Example:
public static int Calculate()
{
// Simulate some work
int sum = 0;
for (int i = 0; i < 10000; i++)
{
sum += i;
}
return sum;
}
// ...
var task = System.Threading.Tasks.Task.Run(() => Calculate());
int result = task.Result; // waits/blocks until the task is finished
In addition to task.Result, you can also wait for the task with await task (async/await pattern) or task.Wait (+ timeout and/or cancellation token).
Threads aren't really supposed to behave like functions. The code you found still lacks synchronization/thread-safety of reading/writing the output variable.
Task Parallel Library provides a better abstraction, Tasks.
Your problem can then be solved by code similar to this:
var result = await Task.Run(() => MethodReturningAValue());
Running tasks like this is actually more lightweight, as it only borrows an existing thread from either the SynchronizationContext or the .NET thread pool, with low overhead.
I highly recommend Stephen Cleary's blog series about using tasks for parallelism and asynchronicity. It should answer all your further questions.

C# RX (System.Reactive) - Async - Publish an IEnumerable<DataRow> to multiple observing data handers

I'm new to RX.
I'd like to traverse an IEnumerable and publish to multi DataHandlers that process the data in their respective threads.
Below is my sample program. The publish works and a new thread is created, but the 3 RowHandlers are all running in 1 thread. I need 3 threads. What is the best way to implement this?
class Program
{
public class MyDataGenerator
{
public IEnumerable<int> myData()
{
//Heavy lifting....Don't want to process more than once.
yield return 1;
yield return 2;
yield return 3;
yield return 4;
yield return 5;
yield return 6;
}
}
static void Main(string[] args)
{
MyDataGenerator h = new MyDataGenerator();
Console.WriteLine("Thread id " + Thread.CurrentThread.ManagedThreadId.ToString());
//
var shared = h.myData().ToObservable().Publish();
///////////////////////////////
// Row Handling Requirements
//
// 1. Single Scan of IEnumerable.
// 2. Row handlers process data in their own threads.
// 3. OK if scanning thread blocks while data is processed
//
//Create the RowHandlers
MyRowHandler rn1 = new MyRowHandler();
rn1.ido = shared.Subscribe(i => rn1.processID(i));
MyRowHandler rn2 = new MyRowHandler();
rn2.ido = shared.Subscribe(i => rn2.processID(i));
MyRowHandler rn3 = new MyRowHandler();
rn3.ido = shared.Subscribe(i => rn3.processID(i));
//
shared.Connect();
}
public class MyRowHandler
{
public IDisposable ido = null;
public void processID(int i)
{
var o = Observable.Start(() =>
{
Console.WriteLine(String.Format("Start Thread ID {0} Int{1}", Thread.CurrentThread.ManagedThreadId, i));
Thread.Sleep(30);
Console.WriteLine("Done Thread ID"+Thread.CurrentThread.ManagedThreadId.ToString());
}
);
o.First();
}
}
}
Discovery :
The coding speed & code quality gains one receives from Rx come at the expense of performance. Task/Delegates are without a doubt multiples faster. That means that the most important thing one needs to learn about Rx is when to use Rx. Below is a draft summary guideline. For large volumes I can see use for Rx in chuncking, combining, and other many stream-many handler models; however, basic Async should not use rx.
I'd post an image with a matrix guideline, but the site won't let me post images
If I understand your sequencing requirements correctly and you want three parallel running scans, you can just observe on the TaskPool and subscribe from there;
...
//Create the RowHandlers
MyRowHandler rn1 = new MyRowHandler();
rn1.ido = shared.ObserveOn(Scheduler.TaskPool).Subscribe(i => rn1.processID(i));
...
Note that since you're then running asynchronously and your main thread doesn't wait for the scans to get done, your program will terminate right away unless you for example put a Console.ReadKey() at the end of the program.
EDIT: Regarding running the same thread "all the way", you're scheduling a bit strangely for that. If you drop the observable in the rowhandler, you can use Scheduler.NewThread and get good results;
...
var rowHandler1 = new MyRowHandler();
rowHandler1.ido = shared.ObserveOn(Scheduler.NewThread).Subscribe(rowHandler1.ProcessID);
...
public void ProcessID(int i)
{
Console.WriteLine(String.Format("Start Thread ID {0} Int{1}", Thread.CurrentThread.ManagedThreadId, i));
Thread.Sleep(30);
Console.WriteLine("Done Thread ID" + Thread.CurrentThread.ManagedThreadId.ToString(CultureInfo.InvariantCulture));
}
That will give each subscription its own thread, and stay with it.

Trouble with ConcurrentStack<T> in .net c#4

This class is designed to take a list of urls, scan them, then return a list of those which does not work. It uses multiple threads to avoid taking forever on long lists.
My problem is that even if i replace the actual scanning of urls with a test function which returns failure on all urls, the class returns a variable amount of failures.
I'm assuming my problem lies either with ConcurrentStack.TryPop() or .Push(), but I cant for the life of me figure out why. They are supposedly thread safe, and I've tried locking as well, no help there.
Anyone able to explain to me what I am doing wrong? I don't have a lot of experience with multiple threads..
public class UrlValidator
{
private const int MAX_THREADS = 10;
private List<Thread> threads = new List<Thread>();
private ConcurrentStack<string> errors = new ConcurrentStack<string>();
private ConcurrentStack<string> queue = new ConcurrentStack<string>();
public UrlValidator(List<string> urls)
{
queue.PushRange(urls.ToArray<string>());
}
public List<string> Start()
{
threads = new List<Thread>();
while (threads.Count < MAX_THREADS && queue.Count > 0)
{
var t = new Thread(new ThreadStart(UrlWorker));
threads.Add(t);
t.Start();
}
while (queue.Count > 0) Thread.Sleep(1000);
int runningThreads = 0;
while (runningThreads > 0)
{
runningThreads = 0;
foreach (Thread t in threads) if (t.ThreadState == ThreadState.Running) runningThreads++;
Thread.Sleep(100);
}
return errors.ToList<string>();
}
private void UrlWorker()
{
while (queue.Count > 0)
{
try
{
string url = "";
if (!queue.TryPop(out url)) continue;
if (TestFunc(url) != 200) errors.Push(url);
}
catch
{
break;
}
}
}
private int TestFunc(string url)
{
Thread.Sleep(new Random().Next(100));
return -1;
}
}
This is something that the Task Parallel Library and PLINQ (Parallel LINQ) would be really good at. Check out an example of how much easier things will be if you let .NET do its thing:
public IEnumerable<string> ProcessURLs(IEnumerable<string> URLs)
{
return URLs.AsParallel()
.WithDegreeOfParallelism(10)
.Where(url => testURL(url));
}
private bool testURL(string URL)
{
// some logic to determine true/false
return false;
}
Whenever possible, you should let the libraries .NET provides do any thread management needed. The TPL is great for this in general, but since you're simply transforming a single collection of items, PLINQ is well suited for this. You can modify the degree of parallelism (I would recommend setting it less than your maximum number of concurrent TCP connections), and you can add multiple conditions just like LINQ allows. Automatically runs parallel, and makes you do no thread management.
Your problem has nothing to do with ConcurrentStack, but rather with the loop where you are checking for running threads:
int runningThreads = 0;
while (runningThreads > 0)
{
...
}
The condition is immediately false, so you never actually wait for threads. In turn, this means that errors will contain errors from whichever threads have run so far.
However, your code has other issues, but creating threads manually is probably the greatest one. Since you are using .NET 4.0, you should use tasks or PLINQ for asynchronous processing. Using PLINQ, your validation can be implemented as:
public IEnumerable<string> Validate(IEnumerable<string> urls)
{
return urls.AsParallel().Where(url => TestFunc(url) != 200);
}

Is this a good impl for a Producer/Consumer unique keyed buffer?

Can anyone see any problems with this Producer/Consumer unique keyed buffer impl? The idea is if you add items for processing with the same key only the lastest value will be processed and the old/existing value will be thrown away.
public sealed class PCKeyedBuffer<K,V>
{
private readonly object _locker = new object();
private readonly Thread _worker;
private readonly IDictionary<K, V> _items = new Dictionary<K, V>();
private readonly Action<V> _action;
private volatile bool _shutdown;
public PCKeyedBuffer(Action<V> action)
{
_action = action;
(_worker = new Thread(Consume)).Start();
}
public void Shutdown(bool waitForWorker)
{
_shutdown = true;
if (waitForWorker)
_worker.Join();
}
public void Add(K key, V value)
{
lock (_locker)
{
_items[key] = value;
Monitor.Pulse(_locker);
}
}
private void Consume()
{
while (true)
{
IList<V> values;
lock (_locker)
{
while (_items.Count == 0) Monitor.Wait(_locker);
values = new List<V>(_items.Values);
_items.Clear();
}
foreach (V value in values)
{
_action(value);
}
if(_shutdown) return;
}
}
}
static void Main(string[] args)
{
PCKeyedBuffer<string, double> l = new PCKeyedBuffer<string, double>(delegate(double d)
{
Thread.Sleep(10);
Console.WriteLine(
"Processed: " + d.ToString());
});
for (double i = 0; i < 100; i++)
{
l.Add(i.ToString(), i);
}
for (double i = 0; i < 100; i++)
{
l.Add(i.ToString(), i);
}
for (double i = 0; i < 100; i++)
{
l.Add(i.ToString(), i);
}
Console.WriteLine("Done Enqeueing");
Console.ReadLine();
}
After a quick once over I would say that the following code in the Consume method
while (_items.Count == 0) Monitor.Wait(_locker);
Should probably Wait using a timeout and check the _shutdown flag each iteration. Especially since you are not setting your consumer thread to be aq background thread.
In addition, the Consume method does not appear very scalable, since it single handedly tries to process an entire queue of items. Of course this might depend on the rate that items are being produced. I would probably have the consumer focus on a single item in the list and then use TPL to run multiple concurrent consumers, this way you can take advantage of multple cores while letting TPL balance the work load for you. To reduce the required locking for the consumer processing a single item you could use a ConcurrentDictionary
As Chris pointed out, ConcurrentDictionary already exists and is more scalable. It was added to the base libraries in .NET 4.0, and is also available as an add-on to .NET 3.5.
This is one of the few attempts at creating a custom producer/consumer that is actually correct. So job well done in that regard. However, like Chris pointed out your stop flag will be ignored while Monitor.Wait is blocked. There is no need to rehash his suggestion for fixing that. The advice I can offer is to use a BlockingCollection instead of doing the Wait/Pulse calls manually. That would also solve the shutdown problem since the Take method is cancellable. If you are not using .NET 4.0 then it available in the Reactive Extension download that Stephen linked to. If that is not an option then Stephen Toub has a correct implementation here (except his is not cancellable, but you can always do a Thread.Interrupt to safely unblock it). What you can do is feed in KeyValuePair items into the queue instead of using a Dictionary.

Categories