Using multiple threads to work through a large List? - c#

I have been assigned the task of converting a large list (4 million) of ids to usernames. For this purpose I've decided to delegate multiple tasks to my premium proxies.
public class ProxyWorker
{
private static int _proxyCount;
static void Run(List<long> largeList)
{
var taskList = new List<Task>();
for (int i = 0; i < _proxyCount; i++)
{
taskList.Add(Task.Factory.StartNew(() => ConvertOnProxy(i, largeList.Take(1000).ToList())));
}
Task.WaitAll(taskList.ToArray());
}
static void ConvertOnProxy(int proxyId, List<long> idsToConvert)
{
// TODO
}
}
I'm stuck on the part of how would I delegate 1,000 to each task, removing them from the list after they've been select so another thread doesn't take them, and keeping the thread safety?
I understand that my current code just grabs 1,000 items without thinking another task is going to do the exact same thing?

Here's an example of where I would start:
static async Task Test()
{
Queue<int> ids = new Queue<int>(Enumerable.Range(0, 100));
List<Task> tasks = new List<Task>();
for (int i = 0; i < 8; i++)
{
tasks.Add(DoTheThings(ids));
}
await Task.WhenAll(tasks);
}
static async Task DoTheThings(Queue<int> ids)
{
Random rnd = new Random();
int id;
for (;;)
{
lock (ids)
{
if (ids.Count == 0)
{
// All done.
return;
}
id = ids.Dequeue();
}
Debug.WriteLine($"Fetching ID {id}...");
// Simulate variable network delay.
await Task.Delay(rnd.Next(200) + 50);
}
}

Related

Restart concurrent tasks as soon as they fail for x number of times

I have a console app that is making HTTP queries and adding/updating products in my database according to response. Some fail and need to be retried a few times.
The way I came up with was to use a dictionary to store the product ID and a Task. Then I can check all the task results and re-run.
This is working but it strikes me as inefficient. Tasks are not being re-created until all tasks have finished. It would be more efficient if they were immediately restarted but I can't figure out how to do this. Also every retry involves a query to the database as only the ID is stored.
I made small app that shows how I am currently retrying failed requests.
Can someone suggest a more efficient method for retrying?
class Program
{
private static void Main(string[] args)
{
HttpQuery m = new HttpQuery();
var task = Task.Run(() => m.Start());
Task.WaitAll(task);
Console.WriteLine("Finished");
Console.ReadLine();
}
}
class HttpQuery
{
public async Task Start()
{
// dictionary where key represent reference to something that needs to be processed and bool whether it has completed or not
ConcurrentDictionary<int, Task<bool>> monitor = new ConcurrentDictionary<int, Task<bool>>();
// start async tasks.
Console.WriteLine("starting first try");
for (int i = 0; i < 1000; i++)
{
Console.Write(i+",");
monitor[i] = this.Query(i);
}
// wait for completion
await Task.WhenAll(monitor.Values.ToArray());
Console.WriteLine();
// start retries
// number of retries per query
int retries = 10;
int count = 0;
// check if max retries exceeded or all completed
while (count < retries && monitor.Any(x => x.Value.Result == false))
{
// make list of numbers that failed
List<int> retryList = monitor.Where(x => x.Value.Result == false).Select(x => x.Key).ToList();
Console.WriteLine("starting try number: " + (count+1) + ", Processing: " + retryList.Count);
// create list of tasks to wait for
List<Task<bool>> toWait = new List<Task<bool>>();
foreach (var i in retryList)
{
Console.Write(i + ",");
monitor[i] = this.Query(i);
toWait.Add(monitor[i]);
}
// wait for completion
await Task.WhenAll(toWait.ToArray());
Console.WriteLine();
count++;
}
Console.WriteLine("ended");
Console.ReadLine();
}
public async Task<bool> Query(int i)
{
// simulate a http request that may or may not fail
Random r = new Random();
int delay = i * r.Next(1, 10);
await Task.Delay(delay);
if (r.Next(0,2) == 1)
{
return true;
}
else
{
return false;
}
}
}
You can create another method and wrap all these ugly retry logic. All of that ugly code goes away :)
public async Task Start()
{
const int MaxNumberOfTries = 10;
List<Task<bool>> tasks = new List<Task<bool>>();
for (int i = 0; i < 1000; i++)
{
tasks.Add(this.QueryWithRetry(i, MaxNumberOfTries));
}
await Task.WhenAll(tasks);
}
public async Task<bool> QueryWithRetry(int i, int numOfTries)
{
int tries = 0;
bool result;
do
{
result = await Query(i);
tries++;
} while (!result && tries < numOfTries);
return result;
}

Async Task.Run Not Working

I simply wrote below codes and I expect to have 3 text files with async feature in C# but I do not see anything:
private async void Form1_Load(object sender, EventArgs e)
{
Task<int> file1 = test();
Task<int> file2 = test();
Task<int> file3 = test();
int output1 = await file1;
int output2 = await file2;
int output3 = await file3;
}
async Task<int> test()
{
return await Task.Run(() =>
{
string content = "";
for (int i = 0; i < 100000; i++)
{
content += i.ToString();
}
System.IO.File.WriteAllText(string.Format(#"c:\test\{0}.txt", new Random().Next(1, 5000)), content);
return 1;
});
}
There are a few potential issues:
Does c:\test\ exist? If not, you'll get an error.
As written, your Random objects might generate the same numbers, since the current system time is used as the seed, and you are doing these at about the same time. You can fix this by making them share a static Random instance. Edit: but you need to synchronize the access to it somehow. I chose a simple lock on the Random instance, which isn't the fastest, but works for this example.
Building a long string that way is very inefficient (e.g. about 43 seconds in Debug mode for me, to do it once). Your tasks might be working just fine, and you don't notice that it's actually doing anything because it takes so long to finish. It can be made much faster by using the StringBuilder class (e.g. about 20 ms).
(this won't affect whether or not it works, but is more of a stylistic thing) you don't need to use the async and await keywords in your test() method as written. They are redundant, since Task.Run already returns a Task<int>.
This works for me:
private async void Form1_Load(object sender, EventArgs e)
{
Task<int> file1 = test();
Task<int> file2 = test();
Task<int> file3 = test();
int output1 = await file1;
int output2 = await file2;
int output3 = await file3;
}
static Random r = new Random();
Task<int> test()
{
return Task.Run(() =>
{
var content = new StringBuilder();
for (int i = 0; i < 100000; i++)
{
content.Append(i);
}
int n;
lock (r) n = r.Next(1, 5000);
System.IO.File.WriteAllText(string.Format(#"c:\test\{0}.txt", n), content.ToString());
return 1;
});
}
Using a different Random instance each time will cause the Random number generation to generate the same number each time!
The random number generation starts from a seed value. If the same seed is used repeatedly, the same series of numbers is generated.
This is because Random uses the computer's time as a seed value but the precision of this is not sufficient for the computer's processing speed.
Use the same Random number generator, example:
internal async Task<int> test()
{
return await Task.Run(() =>
{
string content = "";
for (int i = 0; i < 10000; i++)
{
content += i.ToString();
}
System.IO.File.WriteAllText(string.Format(#"c:\test\{0}.txt",MyRandom.Next(1,5000)), content);
return 1;
});
}
EDIT:
Also Random is not thread safe so you should synchronize access to it:
public static class MyRandom
{
private static Random random = new Random();
public static int Next(int start, int end)
{
lock (random)
{
return random.Next(start,end);
}
}
}

How wait all thread

I have code, that create 5 threads. I need wait, until all threads finished their work, and after return value. How can I do this?
public static int num=-1;
public int GetValue()
{
Thread t=null;
for (int i = 0; i <=5; i++)
{
t = new Thread(() => PasswdThread(i));
t.Start();
}
//how wait all thread, and than return value?
return num;
}
public void PasswdThread(int i)
{
Thread.Sleep(1000);
Random r=new Random();
int n=r.Next(10);
if (n==5)
{
num=r.Next(1000);
}
}
Of course this is not a real code. The actual code is much more complicated, so I simplified it.
P.S. Look carefully. I am not use Task, so I can't use method Wait() or WaitAll(). Also I can't use Join(), because Join wait one thread. If they start wait thread, which already finished they work, the will wait infinity.
Make an array of thread like below and call WaitAll function
List<Thread> threads = new List<Thread>();
Thread thread = null;
for (int i = 0; i <=5; i++)
{
t = new Thread(() => PasswdThread(i));
t.Start();
threads.add(t);
}
Thread.WaitAll(thread);
//how wait all thread, and than return value?
return num;
create a ManualResetEvent handle for each your thread, and then call WaitHandle.WaitAll(handles) in your main thread.
static WaitHandle[] handles = new WaitHandle[5];
`
public void PasswdThread(int i)
{
handles[i] = new ManualResetEvent(false);
Thread.Sleep(1000);
Random r=new Random();
int n=r.Next(10);
if (n==5)
{
num=r.Next(1000);
}
handles[i].Set();
}
Get more information on http://msdn.microsoft.com/en-us/library/z6w25xa6.aspx
I think you can use Thread.WaitAll(thread_array) or in other case you can also use Thread.Sleep(100)
In Thread.sleep, 100 is number of milliseconds. So in this case thread would sleep for 100 milliseconds.
And in Thread.WaitAll - thread_Array is array of threads that you wanna wait.
As this question is effectively a duplicate, please see this answer, (code copied below, all credit to Reed Copsey.
class Program
{
static void Main(string[] args)
{
int numThreads = 10;
ManualResetEvent resetEvent = new ManualResetEvent(false);
int toProcess = numThreads;
// Start workers.
for (int i = 0; i < numThreads; i++)
{
new Thread(delegate()
{
Console.WriteLine(Thread.CurrentThread.ManagedThreadId);
// If we're the last thread, signal
if (Interlocked.Decrement(ref toProcess) == 0)
resetEvent.Set();
}).Start();
}
// Wait for workers.
resetEvent.WaitOne();
Console.WriteLine("Finished.");
}
}
Aside
Also note that your PasswdThread code will not produce random numbers. The Random object should be declared statically, outside of your method, to produce random numbers.
Additionally you never use the int i parameter of that method.
I would use TPL for this, imo it's the most up to date technique for handling this sort of synchronization. Given the real life code is probably more complex, I'll rework the example slightly:
public int GetValue()
{
List<Task<int>> tasks = new List<Task<int>>();
for (int i = 0; i <=5; i++)
{
tasks.Add(PasswdThread(i));
}
Task.WaitAll(tasks);
// You can now query all the tasks:
foreach (int result in tasks.Select(t => t.Result))
{
if (result == 100) // Do something to pick the desired result...
{
return result;
}
}
return -1;
}
public Task<int> PasswdThread(int i)
{
return Task.Factory.StartNew(() => {
Thread.Sleep(1000);
Random r=new Random();
int n=r.Next(10);
if (n==5)
{
return r.Next(1000);
}
return 0;
});
}
Thread t=null;
List<Thread> lst = new List<Thread();
for (int i = 0; i <=5; i++)
{
t = new Thread(() => PasswdThread(i));
lst.Add(t);
t.Start();
}
//how wait all thread, and than return value?
foreach(var item in lst)
{
while(item.IsAlive)
{
Thread.Sleep(5);
}
}
return num;

.NET 2.0 Processing very large lists using ThreadPool

This is further to my question here
By doing some reading .... I moved away from Semaphores to ThreadPool.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading;
namespace ThreadPoolTest
{
class Data
{
public int Pos { get; set; }
public int Num { get; set; }
}
class Program
{
static ManualResetEvent[] resetEvents = new ManualResetEvent[20];
static void Main(string[] args)
{
int s = 0;
for (int i = 0; i < 100000; i++)
{
resetEvents[s] = new ManualResetEvent(false);
Data d = new Data();
d.Pos = s;
d.Num = i;
ThreadPool.QueueUserWorkItem(new WaitCallback(Process), (object)d);
if (s >= 19)
{
WaitHandle.WaitAll(resetEvents);
Console.WriteLine("Press Enter to Move forward");
Console.ReadLine();
s = 0;
}
else
{
s = s + 1;
}
}
}
private static void Process(object o)
{
Data d = (Data) o;
Console.WriteLine(d.Num.ToString());
Thread.Sleep(10000);
resetEvents[d.Pos].Set();
}
}
}
This code works and I am able to process in the sets of 20. But I don't like this code because of WaitAll. So let's say I start a batch of 20, and 3 threads take longer time while 17 have finished. Even then I will keep the 17 threads as waiting because of the WaitAll.
WaitAny would have been good... but it seems rather messy that I will have to build so much of control structures like Stacks, Lists, Queues etc in order to use the pool efficiently.
The other thing I don't like is that whole global variable in the class for resetEvents. because this array has to be shared between the Process method and the main loop.
The above code works... but I need your help in improving it.
Again... I am on .NET 2.0 VS 2008. I cannot use .NET 4.0 parallel/async framework.
There are several ways you can do this. Probably the easiest, based on what you've posted above, would be:
const int MaxThreads = 4;
const int ItemsToProcess = 10000;
private Semaphore _sem = new Semaphore(MaxThreads, MaxThreads);
void DoTheWork()
{
int s = 0;
for (int i = 0; i < ItemsToProcess; ++i)
{
_sem.WaitOne();
Data d = new Data();
d.Pos = s;
d.Num = i;
ThreadPool.QueueUserWorkItem(Process, d);
++s;
if (s >= 19)
s = 0;
}
// All items have been assigned threads.
// Now, acquire the semaphore "MaxThreads" times.
// When counter reaches that number, we know all threads are done.
int semCount = 0;
while (semCount < MaxThreads)
{
_sem.WaitOne();
++semCount;
}
// All items are processed
// Clear the semaphore for next time.
_sem.Release(semCount);
}
void Process(object o)
{
// do the processing ...
// release the semaphore
_sem.Release();
}
I only used four threads in my example because that's how many cores I have. It makes little sense to be using 20 threads when only four of them can be processing at any one time. But you're free to increase the MaxThreads number if you like.
So I'm pretty sure this is all .NET 2.0.
We'll start out defining Action, because I'm so used to using it. If using this solution in 3.5+, remove that definition.
Next, we create a queue of actions based on the input.
After that we define a callback; this callback is the meat of the method.
It first grabs the next item in the queue (using a lock since the queue isn't thread safe). If it ended up having an item to grab it executes that item. Next it adds a new item to the thread pool which is "itself". This is a recursive anonymous method (you don't come across uses of that all that often). This means that when the callback is called for the first time it will execute one item, then schedule a task which will execute another item, and that item will schedule a task that executes another item, and so on. Eventually the queue will run out, and they'll stop queuing more items.
We also want the method to block until we're all done, so for that we keep track of how many of these callbacks have finished through incrementing a counter. When that counter reaches the task limit we signal the event.
Finally we start N of these callbacks in the thread pool.
public delegate void Action();
public static void Execute(IEnumerable<Action> actions, int maxConcurrentItems)
{
object key = new object();
Queue<Action> queue = new Queue<Action>(actions);
int count = 0;
AutoResetEvent whenDone = new AutoResetEvent(false);
WaitCallback callback = null;
callback = delegate
{
Action action = null;
lock (key)
{
if (queue.Count > 0)
action = queue.Dequeue();
}
if (action != null)
{
action();
ThreadPool.QueueUserWorkItem(callback);
}
else
{
if (Interlocked.Increment(ref count) == maxConcurrentItems)
whenDone.Set();
}
};
for (int i = 0; i < maxConcurrentItems; i++)
{
ThreadPool.QueueUserWorkItem(callback);
}
whenDone.WaitOne();
}
Here's another option that doesn't use the thread pool, and just uses a fixed number of threads:
public static void Execute(IEnumerable<Action> actions, int maxConcurrentItems)
{
Thread[] threads = new Thread[maxConcurrentItems];
object key = new object();
Queue<Action> queue = new Queue<Action>(actions);
for (int i = 0; i < maxConcurrentItems; i++)
{
threads[i] = new Thread(new ThreadStart(delegate
{
Action action = null;
do
{
lock (key)
{
if (queue.Count > 0)
action = queue.Dequeue();
else
action = null;
}
if (action != null)
{
action();
}
} while (action != null);
}));
threads[i].Start();
}
for (int i = 0; i < maxConcurrentItems; i++)
{
threads[i].Join();
}
}

Threaded simultaneous jobs

There is a string array myDownloadList containing 100 string URIs. I want to start 5 thread jobs that will pop next URI from myDownloadList (like a stack) and do something with it (download it), until there is no URIs left on a stack (myDownloadList).
What would be the best practice to do this?
Use the ThreadPool, and just setup all of your requests. The ThreadPool will automatically schedule them appropriately.
This will get easier with .NET 4, using the Task Parallel Library. Setting up each request as a Task is very efficient and easy.
Make sure each thread locks the myDownloadList when accessing it. You could use recursion to keep getting the latest one, then when the list is 0 it can just stop the function.
See the example below.
public static List<string> MyList { get; set; }
public static object LockObject { get; set; }
static void Main(string[] args)
{
Console.Clear();
Program.LockObject = new object();
// Create the list
Program.MyList = new List<string>();
// Add 100 items to it
for (int i = 0; i < 100; i++)
{
Program.MyList.Add(string.Format("Item Number = {0}", i));
}
// Start Threads
for (int i = 0; i < 5; i++)
{
Thread thread = new Thread(new ThreadStart(Program.PopItemFromStackAndPrint));
thread.Name = string.Format("Thread # {0}", i);
thread.Start();
}
}
public static void PopItemFromStackAndPrint()
{
if (Program.MyList.Count == 0)
{
return;
}
string item = string.Empty;
lock (Program.LockObject)
{
// Get first Item
item = Program.MyList[0];
Program.MyList.RemoveAt(0);
}
Console.WriteLine("{0}:{1}", System.Threading.Thread.CurrentThread.Name, item);
// Sleep to show other processing for examples only
System.Threading.Thread.Sleep(10);
Program.PopItemFromStackAndPrint();
}

Categories