Processing large number of tasks concurrently and asynchronously - c#

I would like to process a list of 50,000 urls through a web service, The provider of this service allows 5 connections per second.
I need to process these urls in parallel with adherence to provider's rules.
This is my current code:
static void Main(string[] args)
{
process_urls().GetAwaiter().GetResult();
}
public static async Task process_urls()
{
// let's say there is a list of 50,000+ URLs
var urls = System.IO.File.ReadAllLines("urls.txt");
var allTasks = new List<Task>();
var throttler = new SemaphoreSlim(initialCount: 5);
foreach (var url in urls)
{
await throttler.WaitAsync();
allTasks.Add(
Task.Run(async () =>
{
try
{
Console.WriteLine(String.Format("Starting {0}", url));
var client = new HttpClient();
var xml = await client.GetStringAsync(url);
//do some processing on xml output
client.Dispose();
}
finally
{
throttler.Release();
}
}));
}
await Task.WhenAll(allTasks);
}
Instead of var client = new HttpClient(); I will create a new object of the target web service but this is just to make the code generic.
Is this the correct approach to handle and process a huge list of connections? and is there anyway I can limit the number of established connections per second to 5 as the current implementation will not consider any timeframe?
Thanks

Reading values from web service is IO operation which can be done asynchronously without multithreading.
Threads do nothing - only waiting for response in this case. So using parallel is just wasting of resources.
public static async Task process_urls()
{
var urls = System.IO.File.ReadAllLines("urls.txt");
var allTasks = new List<Task>();
var throttler = new SemaphoreSlim(initialCount: 5);
foreach (var urlGroup in SplitToGroupsOfFive(urls))
{
var tasks = new List<Task>();
foreach(var url in urlGroup)
{
var task = ProcessUrl(url);
tasks.Add(task);
}
// This delay will sure that next 5 urls will be used only after 1 seconds
tasks.Add(Task.Delay(1000));
await Task.WhenAll(tasks.ToArray());
}
}
private async Task ProcessUrl(string url)
{
using (var client = new HttpClient())
{
var xml = await client.GetStringAsync(url);
//do some processing on xml output
}
}
private IEnumerable<IEnumerable<string>> SplitToGroupsOfFive(IEnumerable<string> urls)
{
var const GROUP_SIZE = 5;
var string[] group = null;
var int count = 0;
foreach (var url in urls)
{
if (group == null)
group = new string[GROUP_SIZE];
group[count] = url;
count++;
if (count < GROUP_SIZE)
continue;
yield return group;
group = null;
count = 0;
}
if (group != null && group.Length > 0)
{
yield return group.Take(group.Length);
}
}
Because you mention that "processing" of response is also IO operation, then async/await approach is most efficient, because it using only one thread and process other tasks when previous tasks waiting for response from web service or from file writing IO operations.

Related

To Implement watch operation on multiple mongo databases in c# using changestreams

var options= new ChangeStreamOptions {
FullDocument = ChangeFullStreamOption.UpdateLookup
};
var enumerator = newclient
.GetDatabase(databaseName)
.Watch(options)
.ToEnumerable()
.GetEnumerator();
enumerator.MoveNext();
Above code is watching single database , need to watch the updates in multiple database parallelly
I would suggest using channels for this as they are designed for synchronising parallel sources for processing
This would look something like this
using MongoDB.Bson;
using MongoDB.Driver;
using System.Threading.Channels;
var dblist = new string[]{
"db1",
"db2",
"db3"
};
var options = new ChangeStreamOptions
{
FullDocument = ChangeStreamFullDocumentOption.UpdateLookup
};
var client = new MongoClient();
var watcher = Channel.CreateUnbounded<ChangeStreamDocument<BsonDocument>>();
var tokenSource = new CancellationTokenSource();
List<Task> tasks = new List<Task>();
foreach (var db in dblist)
{
//create a monitor for each database
tasks.Add(
monitorDB(db,client, watcher.Writer, tokenSource.Token)
);
}
//create the processor for your changes
var processor = processChanges(watcher.Reader);
//wait for all the monitors to complete
await Task.WhenAll(tasks);
// when all monitors have ended mark the channel as completed
watcher.Writer.Complete();
//wait for the processorto complete
await processor;
async Task monitorDB(string name, IMongoClient client, ChannelWriter<ChangeStreamDocument<BsonDocument>> writer, CancellationToken token)
{
var enumerator = client
.GetDatabase(name)
.Watch(options, token)
.ToEnumerable()
;
foreach (var change in enumerator)
{
// wait for the channel to be ready to write
await writer.WaitToWriteAsync();
// write the change to the channel
await writer.WriteAsync(change);
//if cancelled exit
if (token.IsCancellationRequested)
break;
}
}
async Task processChanges(ChannelReader<ChangeStreamDocument<BsonDocument>> reader)
{
//while the channel isn't complete, wait for a new document to be received
while(await reader.WaitToReadAsync())
{
//get the waiting document
ChangeStreamDocument<BsonDocument>> doc = await reader.ReadAsync();
//do something with doc ...
}
}

Why is using tasks with HttpClient synchronously so much slower?

So I was trying to do a quick performance test against a web api to see how it would handle multiple synchronous HTTP requests at once. I did this by spinning up 30 multiple tasks and have each of them send a http request with the HttpClient. To my surprise, it was extremely slow. I thought it was due to the lack of async/await or the web api was slow, but it turns out it's only when I'm using tasks and synchronous http calls (see TestSynchronousWithParallelTasks() below).
So I did a comparison between using without Tasks, async/await with tasks, and ParallelForEach by making some simple tests. All of these finished quickly around 10-20 milliseconds, but the original case which takes around 20 seconds!
Class: HttpClientTest Passed (5) 19.2 sec TestProject.HttpClientTest.TestAsyncWithParallelTasks Passed 12 ms TestProject.HttpClientTest.TestIterativeAndSynchronous Passed 22 ms TestProject.HttpClientTest.TestParallelForEach Passed 15 ms TestProject.HttpClientTest.TestSynchronousWithParallelTasks Passed 19.1 sec TestProject.HttpClientTest.TestSynchronousWithParallelThreads Passed 10 ms
public class HttpClientTest
{
private HttpClient httpClient;
private readonly ITestOutputHelper _testOutputHelper;
public HttpClientTest(ITestOutputHelper testOutputHelper)
{
_testOutputHelper = testOutputHelper;
ServicePointManager.DefaultConnectionLimit = 100;
httpClient = new HttpClient(new HttpClientHandler { MaxConnectionsPerServer = 100 });
}
[Fact]
public async Task TestSynchronousWithParallelTasks()
{
var tasks = new List<Task>();
var url = "https://localhost:44388/api/values";
for (var i = 0; i < 30; i++)
{
var task = Task.Run(() =>
{
var response = httpClient.GetAsync(url).Result;
var content = response.Content.ReadAsStringAsync().Result;
});
tasks.Add(task);
}
await Task.WhenAll(tasks);
}
[Fact]
public void TestIterativeAndSynchronous()
{
var url = "https://localhost:44388/api/values";
for (var i = 0; i < 30; i++)
{
var response = httpClient.GetAsync(url).Result;
var content = response.Content.ReadAsStringAsync().Result;
}
}
[Fact]
public async Task TestAsyncWithParallelTasks()
{
var url = "https://localhost:44388/api/values";
var tasks = new List<Task>();
for (var i = 0; i < 30; i++)
{
var task = Task.Run(async () =>
{
var response = await httpClient.GetAsync(url);
var content = await response.Content.ReadAsStringAsync();
});
tasks.Add(task);
}
await Task.WhenAll(tasks);
}
[Fact]
public void TestParallelForEach()
{
var url = "https://localhost:44388/api/values";
var n = new int[30];
Parallel.ForEach(n, new ParallelOptions { MaxDegreeOfParallelism = 2 }, (i) =>
{
var response = httpClient.GetAsync(url).Result;
var content = response.Content.ReadAsStringAsync().Result;
});
}
[Fact]
public async Task TestSynchronousWithParallelThreads()
{
var tasks = new List<Task>();
var url = "https://localhost:44388/api/values";
var threads = new List<Thread>();
for (var i = 0; i < 30; i++)
{
var thread = new Thread( () =>
{
var response = httpClient.GetAsync(url).Result;
var content = response.Content.ReadAsStringAsync().Result;
});
thread.Start();
threads.Add(thread);
}
foreach(var thread in threads)
{
thread.Join();
}
}
}
So any idea what's causing this performance hit?
I would have expected TestSynchronousWithParallelTasks() to be faster than TestIterativeAndSynchronous() as you'd be starting more requests at once, even if it's IO bound. While the latter is waiting for each request before starting a new one. So it seems like it's related to the tasks somehow blocking each other?
Edit: Added a test case to use threads instead and it's quick like the rest.

Is my approach correct for concurrent network requests?

I wrote a web crawler and I want to know if my approach is correct. The only issue I'm facing is that it stops after some hours of crawling. No exception, it just stops.
1 - the private members and the constructor:
private const int CONCURRENT_CONNECTIONS = 5;
private readonly HttpClient _client;
private readonly string[] _services = new string[2] {
"https://example.com/items?id=ID_HERE",
"https://another_example.com/items?id=ID_HERE"
}
private readonly List<SemaphoreSlim> _semaphores;
public Crawler() {
ServicePointManager.DefaultConnectionLimit = CONCURRENT_CONNECTIONS;
_client = new HttpClient();
_semaphores = new List<SemaphoreSlim>();
foreach (var _ in _services) {
_semaphores.Add(new SemaphoreSlim(CONCURRENT_CONNECTIONS));
}
}
Single HttpClient instance.
The _services is just a string array that contains the URL, they are not the same domain.
I'm using semaphores (one per domain) since I read that it's not a good idea to use the network queue (I don't remember how it calls).
2 - The Run method, which is the one I will call to start crawling.
public async Run(List<int> ids) {
const int BATCH_COUNT = 1000;
var svcIndex = 0;
var tasks = new List<Task<string>>(BATCH_COUNT);
foreach (var itemId in ids) {
tasks.Add(DownloadItem(svcIndex, _services[svcIndex].Replace("ID_HERE", $"{itemId}")));
if (++svcIndex >= _services.Length) {
svcIndex = 0;
}
if (tasks.Count >= BATCH_COUNT) {
var results = await Task.WhenAll(tasks);
await SaveDownloadedData(results);
tasks.Clear();
}
}
if (tasks.Count > 0) {
var results = await Task.WhenAll(tasks);
await SaveDownloadedData(results);
tasks.Clear();
}
}
DownloadItem is an async function that actually makes the GET request, note that I'm not awaiting it here.
If the number of tasks reaches the BATCH_COUNT, I will await all to complete and save the results to file.
3 - The DownloadItem function.
private async Task<string> DownloadItem(int serviceIndex, string link) {
var needReleaseSemaphore = true;
var result = string.Empty;
try {
await _semaphores[serviceIndex].WaitAsync();
var r = await _client.GetStringAsync(link);
_semaphores[serviceIndex].Release();
needReleaseSemaphore = false;
// DUE TO JSON SIZE, I NEED TO REMOVE A VALUE (IT'S USELESS FOR ME)
var obj = JObject.Parse(r);
if (obj.ContainsKey("blah")) {
obj.Remove("blah");
}
result = obj.ToString(Formatting.None);
} catch {
result = string.Empty;
// SINCE I GOT AN EXCEPTION, I WILL 'LOCK' THIS SERVICE FOR 1 MINUTE.
// IF I RELEASED THIS SEMAPHORE, I WILL LOCK IT AGAIN FIRST.
if (!needReleaseSemaphore) {
await _semaphores[serviceIndex].WaitAsync();
needReleaseSemaphore = true;
}
await Task.Delay(60_000);
} finally {
// RELEASE THE SEMAPHORE, IF NEEDED.
if (needReleaseSemaphore) {
_semaphores[serviceIndex].Release();
}
}
return result;
}
4- The function that saves the result.
private async Task SaveDownloadedData(List<string> myData) {
using var fs = new FileStream("./output.dat", FileMode.Append);
foreach (var res in myData) {
var blob = Encoding.UTF8.GetBytes(res);
await fs.WriteAsync(BitConverter.GetBytes((uint)blob.Length));
await fs.WriteAsync(blob);
}
await fs.DisposeAsync();
}
5- Finally, the Main function.
static async Task Main(string[] args) {
var crawler = new Crawler();
var items = LoadItemIds();
await crawler.Run(items);
}
After all this, is my approach correct? I need to make millions of requests, will take some weeks/months to gather all data I need (due to the connection limit).
After 12 - 14 hours, it just stops and I need to manually restart the app (memory usage is ok, my VPS has 1 GB and it never used more than 60%).

Send multiple requests at once to my WebAPI using Task.WhenAll

I'm trying to send multiple same requests at (almost) once to my WebAPI to do some performance testing.
For this, I am calling PerformRequest multiple times and wait for them using await Task.WhenAll.
I want to calculate the time that each request takes to complete plus the start time of each one of them. In my code,however, I don't know what happens if the result of R3 (request number 3) comes before R1? Would the duration be wrong?
From what I see in the results, I think the results are mixing with each other. For example, the R4's result sets as R1's result. So any help would be appreciated.
GlobalStopWatcher is a static class that I'm using to find the start time of each request.
Basically I want to make sure that elapsedMilliseconds and Duration of each request is associated with the request itself.
So that if the result of request 10th comes before the result of 1st request, then duration would be duration = elapsedTime(10th)-(startTime(1st)). Isn't that the case?
I wanted to add a lock but it seems impossible to add it where there's await keyword.
public async Task<RequestResult> PerformRequest(RequestPayload requestPayload)
{
var url = "myUrl.com";
var client = new RestClient(url) { Timeout = -1 };
var request = new RestRequest { Method = Method.POST };
request.AddHeaders(requestPayload.Headers);
foreach (var cookie in requestPayload.Cookies)
{
request.AddCookie(cookie.Key, cookie.Value);
}
request.AddJsonBody(requestPayload.BodyRequest);
var st = new Stopwatch();
st.Start();
var elapsedMilliseconds = GlobalStopWatcher.Stopwatch.ElapsedMilliseconds;
var result = await client.ExecuteAsync(request).ConfigureAwait(false);
st.Stop();
var duration = st.ElapsedMilliseconds;
return new RequestResult()
{
Millisecond = elapsedMilliseconds,
Content = result.Content,
Duration = duration
};
}
public async Task RunAllTasks(int numberOfRequests)
{
GlobalStopWatcher.Stopwatch.Start();
var arrTasks = new Task<RequestResult>[numberOfRequests];
for (var i = 0; i < numberOfRequests; i++)
{
arrTasks[i] = _requestService.PerformRequest(requestPayload, false);
}
var results = await Task.WhenAll(arrTasks).ConfigureAwait(false);
RequestsFinished?.Invoke(this, results.ToList());
}
Where I think you're going wrong with this is trying to use a static GlobalStopWatcher and then pushing this code into your function that you're testing.
You should keep everything separate and use a new instance of Stopwatch for each RunAllTasks call.
Let's make it so.
Start with these:
public async Task<RequestResult<R>> ExecuteAsync<R>(Stopwatch global, Func<Task<R>> process)
{
var s = global.ElapsedMilliseconds;
var c = await process();
var d = global.ElapsedMilliseconds - s;
return new RequestResult<R>()
{
Content = c,
Millisecond = s,
Duration = d
};
}
public class RequestResult<R>
{
public R Content;
public long Millisecond;
public long Duration;
}
Now you're in a position to test anything that fits the signature of Func<Task<R>>.
Let's try this:
public async Task<int> DummyAsync(int x)
{
await Task.Delay(TimeSpan.FromSeconds(x % 3));
return x;
}
We can set up a test like this:
public async Task<RequestResult<int>[]> RunAllTasks(int numberOfRequests)
{
var sw = Stopwatch.StartNew();
var tasks =
from i in Enumerable.Range(0, numberOfRequests)
select ExecuteAsync<int>(sw, () => DummyAsync(i));
return await Task.WhenAll(tasks).ConfigureAwait(false);
}
Note that the line var sw = Stopwatch.StartNew(); captures a new Stopwatch for each RunAllTasks call. Nothing is actually "global" anymore.
If I execute that with RunAllTasks(7) then I get this result:
It runs and it counts correctly.
Now you can just refactor your PerformRequest method to just do what it needs to:
public async Task<string> PerformRequest(RequestPayload requestPayload)
{
var url = "myUrl.com";
var client = new RestClient(url) { Timeout = -1 };
var request = new RestRequest { Method = Method.POST };
request.AddHeaders(requestPayload.Headers);
foreach (var cookie in requestPayload.Cookies)
{
request.AddCookie(cookie.Key, cookie.Value);
}
request.AddJsonBody(requestPayload.BodyRequest);
var response = await client.ExecuteAsync(request);
return response.Content;
}
Running the tests is easy:
public async Task<RequestResult<string>[]> RunAllTasks(int numberOfRequests)
{
var sw = Stopwatch.StartNew();
var tasks =
from i in Enumerable.Range(0, numberOfRequests)
select ExecuteAsync<string>(sw, () => _requestService.PerformRequest(requestPayload));
return await Task.WhenAll(tasks).ConfigureAwait(false);
}
If there's any doubt about the thread-safety of Stopwatch then you could do this:
public async Task<RequestResult<R>> ExecuteAsync<R>(Func<long> getMilliseconds, Func<Task<R>> process)
{
var s = getMilliseconds();
var c = await process();
var d = getMilliseconds() - s;
return new RequestResult<R>()
{
Content = c,
Millisecond = s,
Duration = d
};
}
public async Task<RequestResult<int>[]> RunAllTasks(int numberOfRequests)
{
var sw = Stopwatch.StartNew();
var tasks =
from i in Enumerable.Range(0, numberOfRequests)
select ExecuteAsync<int>(() => { lock (sw) { return sw.ElapsedMilliseconds; } }, () => DummyAsync(i));
return await Task.WhenAll(tasks).ConfigureAwait(false);
}

How to use HttpClient PostAsync() with threadpool in C#?

I'm using the following code to post an image to a server.
var image= Image.FromFile(#"C:\Image.jpg");
Task<string> upload = Upload(image);
upload.Wait();
public static async Task<string> Upload(Image image)
{
var uriBuilder = new UriBuilder
{
Host = "somewhere.net",
Path = "/path/",
Port = 443,
Scheme = "https",
Query = "process=false"
};
using (var client = new HttpClient())
{
client.DefaultRequestHeaders.Add("locale", "en_US");
client.DefaultRequestHeaders.Add("country", "US");
var content = ConvertToHttpContent(image);
content.Headers.ContentType = MediaTypeHeaderValue.Parse("image/jpeg");
using (var mpcontent = new MultipartFormDataContent("--myFakeDividerText--")
{
{content, "fakeImage", "myFakeImageName.jpg"}
}
)
{
using (
var message = await client.PostAsync(uriBuilder.Uri, mpcontent))
{
var input = await message.Content.ReadAsStringAsync();
return "nothing for now";
}
}
}
}
I'd like to modify this code to run multiple threads. I've used "ThreadPool.QueueUserWorkItem" before and started to modify the code to leverage it.
private void UseThreadPool()
{
int minWorker, minIOC;
ThreadPool.GetMinThreads(out minWorker, out minIOC);
ThreadPool.SetMinThreads(1, minIOC);
int maxWorker, maxIOC;
ThreadPool.GetMaxThreads(out maxWorker, out maxIOC);
ThreadPool.SetMinThreads(4, maxIOC);
var events = new List<ManualResetEvent>();
foreach (var image in ImageCollection)
{
var resetEvent = new ManualResetEvent(false);
ThreadPool.QueueUserWorkItem(
arg =>
{
var img = Image.FromFile(image.getPath());
Task<string> upload = Upload(img);
upload.Wait();
resetEvent.Set();
});
events.Add(resetEvent);
if (events.Count <= 0) continue;
foreach (ManualResetEvent e in events) e.WaitOne();
}
}
The problem is that only one thread executes at a time due to the call to "upload.Wait()". So I'm still executing each thread in sequence. It's not clear to me how I can use PostAsync with a thread-pool.
How can I post images to a server using multiple threads by tweaking the code above? Is HttpClient PostAsync the best way to do this?
I'd like to modify this code to run multiple threads.
Why? The thread pool should only be used for CPU-bound work (and I/O completions, of course).
You can do concurrency just fine with async:
var tasks = ImageCollection.Select(image =>
{
var img = Image.FromFile(image.getPath());
return Upload(img);
});
await Task.WhenAll(tasks);
Note that I removed your Wait. You should avoid using Wait or Result with async tasks; use await instead. Yes, this will cause async to grow through you code, and you should use async "all the way".

Categories