Below is my current code which gets 500 documents(JSON format) from the documentDB per call. I can only do 500 per search and adding it to a concurrent bag(in parallel). The data fetched is based on the id number I provide where to the API and picks it from that range. E.g. id = 500 [gets documents from 501 - 1000]. The below code fills concurrent bag with 25k documents as expected.
int threadNumber = 5;
var concurrentBag = new ConcurrentBag<docClass>();
if (batch == 25000)
{
id = 500;
while (id <= 25000)
{
docs = await client.SearchDocuments<docClass>(GetFollowUpRequest(id), requestOptions);
docClass lastdoc = docs.Documents.Last();
lastid = lastdoc.Id.Id;
Parallel.ForEach(docs.Documents, new ParallelOptions { MaxDegreeOfParallelism = Environment.ProcessorCount }, item =>
{
concurrentBag.Add(item);
});
id = id + 500;
}
}
I wanted to run this whole while loop in threading so that I can do a multiple call to API and fetch 500 documents parallely. I tried to modify the code as below but always I see only 500 documents still in the concurrent bag 'concurrentBag' after the whole run and the skip id stays at 500 and doesnt increment.
int threadNumber = 5;
var concurrentBag = new ConcurrentBag<docClass>();
if (batch == 25000)
{
id = 500;
Task[] tasks = new Task[threadNumber];
for (int j = 0; j < threadNumber; j++)
{
tasks[j] = Task.Run(async() =>
{
while (id <= 25000)
{
docs = await client.SearchDocuments<docClass>(GetFollowUpRequest(id), requestOptions);
docClass lastdoc = docs.Documents.Last();
lastid = lastdoc.Id.Id;
Parallel.ForEach(docs.Documents, new ParallelOptions { MaxDegreeOfParallelism = Environment.ProcessorCount }, item =>
{
concurrentBag.Add(item);
});
id = id + 500;
}
});
}
}
Can you please help what am I doing wrong here?
For loading document from external resources use asynchronous approach without extra threads.
Note, that when you download external resources in parallel, extra threads doing no work, but just waiting for the response, so threads are just being wasted ;)
Asynchronous approach provide possibility to launch multiple requests almost simultaneously, without waiting for every task to complete, but wait only when all tasks are ready.
var maxDocuments = 25000;
var step = 500;
var documentTasks = Enumerable.Range(1, int.Max)
.Select(offset => step * offset)
.TakeWhile(id => id <= maxDocuments)
.Select(id => client.Search<docClass>(GetFollowUpRequest(id), requestOptions))
.ToArray();
await Task.WhenAll(documentTasks);
var allDocuments = documentTasks
.Select(task = task.Result)
.SelectMany(documents => documents)
.ToArray();
Related
I would like to call my API in parallel x number of times so processing can be done quickly.
I have three methods below that I have to call APIs in parallel. I am trying to understand which is the best way to perform this action.
Base Code
var client = new System.Net.Http.HttpClient();
client.DefaultRequestHeaders.Add("Accept", "application/json");
client.BaseAddress = new Uri("https://jsonplaceholder.typicode.com");
var list = new List<int>();
var listResults = new List<string>();
for (int i = 1; i < 5; i++)
{
list.Add(i);
}
1st Method using Parallel.ForEach
Parallel.ForEach(list,new ParallelOptions() { MaxDegreeOfParallelism = 3 }, index =>
{
var response = client.GetAsync("posts/" + index).Result;
var contents = response.Content.ReadAsStringAsync().Result;
listResults.Add(contents);
Console.WriteLine(contents);
});
Console.WriteLine("After all parallel tasks are done with Parallel for each");
2nd Method with Tasks. I am not sure if this runs parallel. Let me know if it does
var loadPosts = new List<Task<string>>();
foreach(var post in list)
{
var response = await client.GetAsync("posts/" + post);
var contents = response.Content.ReadAsStringAsync();
loadPosts.Add(contents);
Console.WriteLine(contents.Result);
}
await Task.WhenAll(loadPosts);
Console.WriteLine("After all parallel tasks are done with Task When All");
3rd Method using Action Block - This is what I believe I should always do but I want to hear from community
var responses = new List<string>();
var block = new ActionBlock<int>(
async x => {
var response = await client.GetAsync("posts/" + x);
var contents = await response.Content.ReadAsStringAsync();
Console.WriteLine(contents);
responses.Add(contents);
},
new ExecutionDataflowBlockOptions
{
MaxDegreeOfParallelism = 6, // Parallelize on all cores
});
for (int i = 1; i < 5; i++)
{
block.Post(i);
}
block.Complete();
await block.Completion;
Console.WriteLine("After all parallel tasks are done with Action block");
Approach number 2 is close. Here's a rule of thumb: I/O bound operations=> use Tasks/WhenAll (asynchrony), compute bound operations => use Parallelism. Http Requests are network I/O.
foreach (var post in list)
{
async Task<string> func()
{
var response = await client.GetAsync("posts/" + post);
return await response.Content.ReadAsStringAsync();
}
tasks.Add(func());
}
await Task.WhenAll(tasks);
var postResponses = new List<string>();
foreach (var t in tasks) {
var postResponse = await t; //t.Result would be okay too.
postResponses.Add(postResponse);
Console.WriteLine(postResponse);
}
I made a little console app to test all the Methods at pinging API "https://jsonplaceholder.typicode.com/todos/{i}" 200 times.
#MikeLimaSierra Method 1 or 3 were the fastest!
Method
DegreeOfParallelism
Time
Not Parallel
n/a
8.4 sec
#LearnAspNet (OP) Method 1
2
5.494 sec
#LearnAspNet (OP) Method 1
30
1.235 sec
#LearnAspNet (OP) Method 3
2
4.750 sec
#LearnAspNet (OP) Method 3
30
1.795 sec
#jamespconnor Method
n/a
21.5 sec
#YuliBonner Method
n/a
21.4 sec
I would use the following, it has no control of concurrency (it will dispatch all HTTP requests in parallel, unlike your 3rd Method) but it is a lot simpler - it only has a single await.
var client = new HttpClient();
var list = new[] { 1, 2, 3, 4, 5 };
var postTasks = list.Select(p => client.GetStringAsync("posts/" + p));
var posts = await Task.WhenAll(postTasks);
foreach (var postContent in posts)
{
Console.WriteLine(postContent);
}
I want to limit the total number of queries that I submit to my database server across all Dataflow blocks to 30. In the following scenario, the throttling of 30 concurrent tasks is per block so it always hits 60 concurrent tasks during execution. Obviously I could limit my parallelism to 15 per block to achieve a system wide total of 30 but this wouldn't be optimal.
How do I make this work? Do I limit (and block) my awaits using SemaphoreSlim, etc, or is there an intrinsic Dataflow approach that works better?
public class TPLTest
{
private long AsyncCount = 0;
private long MaxAsyncCount = 0;
private long TaskId = 0;
private object MetricsLock = new object();
public async Task Start()
{
ExecutionDataflowBlockOptions execOption
= new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 30 };
DataflowLinkOptions linkOption = new DataflowLinkOptions()
{ PropagateCompletion = true };
var doFirstIOWorkAsync = new TransformBlock<Data, Data>(
async data => await DoIOBoundWorkAsync(data), execOption);
var doCPUWork = new TransformBlock<Data, Data>(
data => DoCPUBoundWork(data));
var doSecondIOWorkAsync = new TransformBlock<Data, Data>(
async data => await DoIOBoundWorkAsync(data), execOption);
var doProcess = new TransformBlock<Data, string>(
i => $"Task finished, ID = : {i.TaskId}");
var doPrint = new ActionBlock<string>(
s => Debug.WriteLine(s));
doFirstIOWorkAsync.LinkTo(doCPUWork, linkOption);
doCPUWork.LinkTo(doSecondIOWorkAsync, linkOption);
doSecondIOWorkAsync.LinkTo(doProcess, linkOption);
doProcess.LinkTo(doPrint, linkOption);
int taskCount = 150;
for (int i = 0; i < taskCount; i++)
{
await doFirstIOWorkAsync.SendAsync(new Data() { Delay = 2500 });
}
doFirstIOWorkAsync.Complete();
await doPrint.Completion;
Debug.WriteLine("Max concurrent tasks: " + MaxAsyncCount.ToString());
}
private async Task<Data> DoIOBoundWorkAsync(Data data)
{
lock(MetricsLock)
{
AsyncCount++;
if (AsyncCount > MaxAsyncCount)
MaxAsyncCount = AsyncCount;
}
if (data.TaskId <= 0)
data.TaskId = Interlocked.Increment(ref TaskId);
await Task.Delay(data.Delay);
lock (MetricsLock)
AsyncCount--;
return data;
}
private Data DoCPUBoundWork(Data data)
{
data.Step = 1;
return data;
}
}
Data Class:
public class Data
{
public int Delay { get; set; }
public long TaskId { get; set; }
public int Step { get; set; }
}
Starting point:
TPLTest tpl = new TPLTest();
await tpl.Start();
Why don't you marshal everything to an action block that has the actual limitation?
var count = 0;
var ab1 = new TransformBlock<int, string>(l => $"1:{l}");
var ab2 = new TransformBlock<int, string>(l => $"2:{l}");
var doPrint = new ActionBlock<string>(
async s =>
{
var c = Interlocked.Increment(ref count);
Console.WriteLine($"{c}:{s}");
await Task.Delay(5);
Interlocked.Decrement(ref count);
},
new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 15 });
ab1.LinkTo(doPrint);
ab2.LinkTo(doPrint);
for (var i = 100; i > 0; i--)
{
if (i % 3 == 0) await ab1.SendAsync(i);
if (i % 5 == 0) await ab2.SendAsync(i);
}
ab1.Complete();
ab2.Complete();
await ab1.Completion;
await ab2.Completion;
This is the solution I ended up going with (unless I can figure out how to use a single generic DataFlow block for marshalling every type of database access):
I defined a SemaphoreSlim at the class level:
private SemaphoreSlim ThrottleDatabaseQuerySemaphore = new SemaphoreSlim(30, 30);
I modified the I/O class to call a throttling class:
private async Task<Data> DoIOBoundWorkAsync(Data data)
{
if (data.TaskId <= 0)
data.TaskId = Interlocked.Increment(ref TaskId);
Task t = Task.Delay(data.Delay); ;
await ThrottleDatabaseQueryAsync(t);
return data;
}
The throttling class: (I also have a generic version of the throttling routine because I couldn't figure out how to write one routine to handle both Task and Task<TResult>)
private async Task ThrottleDatabaseQueryAsync(Task task)
{
await ThrottleDatabaseQuerySemaphore.WaitAsync();
try
{
lock (MetricsLock)
{
AsyncCount++;
if (AsyncCount > MaxAsyncCount)
MaxAsyncCount = AsyncCount;
}
await task;
}
finally
{
ThrottleDatabaseQuerySemaphore.Release();
lock (MetricsLock)
AsyncCount--;
}
}
}
The simplest solution to this problem is to configure all your blocks with a limited-concurrency TaskScheduler:
TaskScheduler scheduler = new ConcurrentExclusiveSchedulerPair(
TaskScheduler.Default, maxConcurrencyLevel: 30).ConcurrentScheduler;
ExecutionDataflowBlockOptions execOption = new()
{
TaskScheduler = scheduler,
MaxDegreeOfParallelism = scheduler.MaximumConcurrencyLevel,
};
TaskSchedulers can only limit the concurrency of work done on threads. They can't throttle asynchronous operations that are not running on threads. So in order to enforce the MaximumConcurrencyLevel policy, unfortunately you must pass synchronous delegates to all the Dataflow blocks. For example:
TransformBlock<Data, Data> doFirstIOWorkAsync = new(data =>
{
return DoIOBoundWorkAsync(data).GetAwaiter().GetResult();
}, execOption);
This change will increase the demand for ThreadPool threads, so you'd better increase the number of threads that the ThreadPool creates instantly on demand to a higher value than the default Environment.ProcessorCount:
ThreadPool.SetMinThreads(100, 100); // At the start of the program
I am proposing this solution not because it is optimal, but because it is easy to implement. My understanding is that wasting some RAM on ~30 threads that are going to be blocked most of the time, won't have any measurable negative effect on the type of application that you are working with.
I found a osrm-machine and it returns a json string when i request. The json has some spesific information about 2 location and I am processing the json and getting the value of distance property for building distance matrix of these locations. I have more than 2000 location and this processing takes approximately 4 hours. I need to decrease the execution time with parallelism, but I am very new to the topic. Here is my work, what should I do for optimizing the parallel loop ? or maybe you can drive me to new approach. Thanks
var client = new RestClient("http://127.0.0.1:5000/route/v1/table/");
var watch = System.Diagnostics.Stopwatch.StartNew();
//rawCount = 2500
Parallel.For(0, rowCount, i =>
{
Parallel.For(0, rowCount, j =>
{
//request a server with spesific lats,longs
var request = new RestRequest(String.Format("{0},{1};{2},{3}", le.LocationList[i].longitude, le.LocationList[i].latitude,
le.LocationList[j].longitude, le.LocationList[j].latitude));
//reading the response and deserialize into object
var response = client.Execute<RootObject>(request);
//defining objem with the List of attributes in routes
var objem = response.Data.routes;
//this part reading all distances and durations in each response and take them into dist and dur matrixes.
Parallel.ForEach(objem, (o) =>
{
dist[i, j] = o.distance;
dur[i, j] = o.duration;
threads[i,j] = Thread.CurrentThread.ManagedThreadId;
Thread.Sleep(10);
});
});
});
watch.Stop();
var elapsedMs = watch.ElapsedMilliseconds;
I cleaned it up a bit. One use of parallelism is enough. There still is one loop that you need to look at, because it's overwriting data and I do not know what to do with it, that's your call.
You need to experiement with the maxThreads variable's value a bit. Normally, .NET would spin up enough threads so your processor can handle them, in your case you can spin up more because you know that all of them are just idling waiting for the network stack.
using System;
using System.Linq;
using System.Threading.Tasks;
namespace ConsoleApp7
{
class Program
{
static void Process(int i, int j)
{
var client = new RestClient("http://127.0.0.1:5000/route/v1/table/");
//request a server with spesific lats,longs
var request = new RestRequest(String.Format("{0},{1};{2},{3}", le.LocationList[i].longitude, le.LocationList[i].latitude, le.LocationList[j].longitude, le.LocationList[j].latitude));
//reading the response and deserialize into object
var response = client.Execute<RootObject>(request);
//defining objem with the List of attributes in routes
var objem = response.Data.routes;
//this part reading all distances and durations in each response and take them into dist and dur matrixes.
// !!!
// !!! THIS LOOP NEEDS TO GO.
// !!! IT MAKES NO SENSE!
// !!! YOU ARE OVERWRITING YOUR OWN DATA!
// !!!
Parallel.ForEach(objem, (o) =>
{
dist[i, j] = o.distance;
dur[i, j] = o.duration;
threads[i, j] = Thread.CurrentThread.ManagedThreadId;
Thread.Sleep(10);
});
}
static void Main(string[] args)
{
var watch = System.Diagnostics.Stopwatch.StartNew();
var rowCount = 2500;
var maxThreads = 100;
var allPairs = Enumerable.Range(0, rowCount).SelectMany(x => Enumerable.Range(0, rowCount).Select(y => new { X = x, Y = y }));
Parallel.ForEach(allPairs, new ParallelOptions { MaxDegreeOfParallelism = maxThreads }, pair => Process(pair.X, pair.Y));
watch.Stop();
var elapsedMs = watch.ElapsedMilliseconds;
}
}
}
I have the following code, what it does I don't believe is important, but I'm getting strange behavior.
When I run just the months on separate threads, it runs fine(how it is below), but when I multi-thread the years(uncomment the tasks), it will timeout every time. The timeout is set for 5 minutes for months/20 minutes for years and it will timeout within a minute.
Is there a known reason for this behavior? Am I missing something simple?
public List<PotentialBillingYearItem> GeneratePotentialBillingByYear()
{
var years = new List<PotentialBillingYearItem>();
//var tasks = new List<Task>();
var startYear = new DateTime(DateTime.Today.Year - 10, 1, 1);
var range = new DateRange(startYear, DateTime.Today.LastDayOfMonth());
for (var i = range.Start; i <= range.End; i = i.AddYears(1))
{
var yearDate = i;
//tasks.Add(Task.Run(() =>
//{
years.Add(new PotentialBillingYearItem
{
Total = GeneratePotentialBillingMonths(new PotentialBillingParameters { Year = yearDate.Year }).Average(s => s.Total),
Date = yearDate
});
//}));
}
//Task.WaitAll(tasks.ToArray(), TimeSpan.FromMinutes(20));
return years;
}
public List<PotentialBillingItem> GeneratePotentialBillingMonths(PotentialBillingParameters Parameters)
{
var items = new List<PotentialBillingItem>();
var tasks = new List<Task>();
var year = new DateTime(Parameters.Year, 1, 1);
var range = new DateRange(year, year.LastDayOfYear());
range.Start = range.Start == range.End ? DateTime.Now.FirstDayOfYear() : range.Start.FirstDayOfMonth();
if (range.End > DateTime.Today) range.End = DateTime.Today.LastDayOfMonth();
for (var i = range.Start; i <= range.End; i = i.AddMonths(1))
{
var firstDayOfMonth = i;
var lastDayOfMonth = i.LastDayOfMonth();
var monthRange = new DateRange(firstDayOfMonth, lastDayOfMonth);
tasks.Add(Task.Run(() =>
{
using (var db = new AlbionConnection())
{
var invoices = GetInvoices(lastDayOfMonth);
var timeslipSets = GetTimeslipSets();
var item = new PotentialBillingItem
{
Date = firstDayOfMonth,
PostedInvoices = CalculateInvoiceTotals(invoices.Where(w => w.post_date <= lastDayOfMonth), monthRange),
UnpostedInvoices = CalculateInvoiceTotals(invoices.Where(w => w.post_date == null || w.post_date > lastDayOfMonth), monthRange),
OutstandingDrafts = CalculateOutstandingDraftTotals(timeslipSets)
};
items.Add(item);
}
}));
}
Task.WaitAll(tasks.ToArray(), TimeSpan.FromMinutes(5));
return items;
}
You might consider pre-allocating a bigger number of threadpool threads. The threadpool is very slow to allocate new threads. The code below task only 10 seconds (the theoretical minimum) to run setting the minimum number of threadpool threads to 2.5k, but commenting out the SetMinThreads makes it take over 1:30 seconds.
static void Main(string[] args)
{
ThreadPool.SetMinThreads(2500, 10);
Stopwatch sw = Stopwatch.StartNew();
RunTasksOutter(10);
sw.Stop();
Console.WriteLine($"Finished in {sw.Elapsed}");
}
public static void RunTasksOutter(int num) => Task.WaitAll(Enumerable.Range(0, num).Select(x => Task.Run(() => RunTasksInner(10))).ToArray());
public static void RunTasksInner(int num) => Task.WaitAll(Enumerable.Range(0, num).Select(x => Task.Run(() => Thread.Sleep(10000))).ToArray());
You could also be running out of threadpool threads. Per: https://msdn.microsoft.com/en-us/library/0ka9477y(v=vs.110).aspx one of the times to not use the threadpool (which is used by tasks) is:
You have tasks that cause the thread to block for long periods of time. The thread pool has a maximum number of threads, so a large number of blocked thread pool threads might prevent tasks from starting.
Since IO is being done on these threads maybe consider replacing them with async code or starting them with the LongRunning option? https://msdn.microsoft.com/en-us/library/system.threading.tasks.taskcreationoptions(v=vs.110).aspx
Hi i am spidering the site and reading the contents.I want to keep the request rate reasonable. Up to approx 10 requests per second should probably be ok.Currently it is 5k request per minute and it is causing security issues as this looks to be a bot activity.
How to do this? Here is my code
protected void Iterareitems(List<Item> items)
{
foreach (var item in items)
{
GetImagesfromItem(item);
if (item.HasChildren)
{
Iterareitems(item.Children.ToList());
}
}
}
protected void GetImagesfromItem(Item childitems)
{
var document = new HtmlWeb().Load(completeurl);
var urls = document.DocumentNode.Descendants("img")
.Select(e => e.GetAttributeValue("src", null))
.Where(s => !string.IsNullOrEmpty(s)).ToList();
}
You need System.Threading.Semaphore, using which you can control the max concurrent threads/tasks. Here is an example:
var maxThreads = 3;
var semaphore = new Semaphore(maxThreads, maxThreads);
for (int i = 0; i < 10; i++) //10 tasks in total
{
var j = i;
Task.Factory.StartNew(() =>
{
semaphore.WaitOne();
Console.WriteLine("start " + j.ToString());
Thread.Sleep(1000);
Console.WriteLine("end " + j.ToString());
semaphore.Release();
});
}
You can see at most 3 tasks are working, others are pending by semaphore.WaitOne() because the maximum limit reached, and the pending thread will continue if another thread released the semaphore by semaphore.Release().