I am trying to refactor my code below to increase its speed performance. I have noticed that the code in the foreach seems to be executed one after another, instead of in parallel to speed up the execution of the UpdateSites function. I just need UpdateSites to run in the background. What can I do to increase its performance?
await UpdateSites(currentUserSites);
private async Task<List<Site>> UpdateSites(List<Site> sites)
{
foreach (var site in sites)
{
var newSite = await FetchSite(site.SiteId);
site.address = newSite.address;
site.phone = newSite.phone;
}
return sites;
}
public async Task<SiteSimple> FetchSite(int siteId)
{
var url = $"/site/{siteId}/simple";
return await ExecuteRestRequest<SiteSimple>(url, Method.GET);
}
I refactored your code according to your need.
private Task<List<Site>> UpdateSites(List<Site> sites)
{
return sites.Select(x => FetchSite(site.SiteId))
}
public Task<SiteSimple> FetchSite(int siteId)
{
var url = $"/site/{siteId}/simple";
return ExecuteRestRequest<SiteSimple>(url, Method.GET);
}
then
var resultList = await Task.WhenAll(UpdateSites(currentUserSites));
foreach (var item in result)
{
//make your operation
}
You may want to use Task.WaitAll to fetch all tasks in parallel
private async Task<List<Site>> UpdateSites(List<Site> sites)
{
var newSites = (await Task.WhenAll(sites.Select(site => FetchSite(site.SiteId))
.ToList();
for(int i = 0; i < sites.Count; ++i)
{
sites[i].address = newSites[i].address;
sites[i].phone= newSites[i].phone;
}
return sites;
}
Related
I wrote a web crawler and I want to know if my approach is correct. The only issue I'm facing is that it stops after some hours of crawling. No exception, it just stops.
1 - the private members and the constructor:
private const int CONCURRENT_CONNECTIONS = 5;
private readonly HttpClient _client;
private readonly string[] _services = new string[2] {
"https://example.com/items?id=ID_HERE",
"https://another_example.com/items?id=ID_HERE"
}
private readonly List<SemaphoreSlim> _semaphores;
public Crawler() {
ServicePointManager.DefaultConnectionLimit = CONCURRENT_CONNECTIONS;
_client = new HttpClient();
_semaphores = new List<SemaphoreSlim>();
foreach (var _ in _services) {
_semaphores.Add(new SemaphoreSlim(CONCURRENT_CONNECTIONS));
}
}
Single HttpClient instance.
The _services is just a string array that contains the URL, they are not the same domain.
I'm using semaphores (one per domain) since I read that it's not a good idea to use the network queue (I don't remember how it calls).
2 - The Run method, which is the one I will call to start crawling.
public async Run(List<int> ids) {
const int BATCH_COUNT = 1000;
var svcIndex = 0;
var tasks = new List<Task<string>>(BATCH_COUNT);
foreach (var itemId in ids) {
tasks.Add(DownloadItem(svcIndex, _services[svcIndex].Replace("ID_HERE", $"{itemId}")));
if (++svcIndex >= _services.Length) {
svcIndex = 0;
}
if (tasks.Count >= BATCH_COUNT) {
var results = await Task.WhenAll(tasks);
await SaveDownloadedData(results);
tasks.Clear();
}
}
if (tasks.Count > 0) {
var results = await Task.WhenAll(tasks);
await SaveDownloadedData(results);
tasks.Clear();
}
}
DownloadItem is an async function that actually makes the GET request, note that I'm not awaiting it here.
If the number of tasks reaches the BATCH_COUNT, I will await all to complete and save the results to file.
3 - The DownloadItem function.
private async Task<string> DownloadItem(int serviceIndex, string link) {
var needReleaseSemaphore = true;
var result = string.Empty;
try {
await _semaphores[serviceIndex].WaitAsync();
var r = await _client.GetStringAsync(link);
_semaphores[serviceIndex].Release();
needReleaseSemaphore = false;
// DUE TO JSON SIZE, I NEED TO REMOVE A VALUE (IT'S USELESS FOR ME)
var obj = JObject.Parse(r);
if (obj.ContainsKey("blah")) {
obj.Remove("blah");
}
result = obj.ToString(Formatting.None);
} catch {
result = string.Empty;
// SINCE I GOT AN EXCEPTION, I WILL 'LOCK' THIS SERVICE FOR 1 MINUTE.
// IF I RELEASED THIS SEMAPHORE, I WILL LOCK IT AGAIN FIRST.
if (!needReleaseSemaphore) {
await _semaphores[serviceIndex].WaitAsync();
needReleaseSemaphore = true;
}
await Task.Delay(60_000);
} finally {
// RELEASE THE SEMAPHORE, IF NEEDED.
if (needReleaseSemaphore) {
_semaphores[serviceIndex].Release();
}
}
return result;
}
4- The function that saves the result.
private async Task SaveDownloadedData(List<string> myData) {
using var fs = new FileStream("./output.dat", FileMode.Append);
foreach (var res in myData) {
var blob = Encoding.UTF8.GetBytes(res);
await fs.WriteAsync(BitConverter.GetBytes((uint)blob.Length));
await fs.WriteAsync(blob);
}
await fs.DisposeAsync();
}
5- Finally, the Main function.
static async Task Main(string[] args) {
var crawler = new Crawler();
var items = LoadItemIds();
await crawler.Run(items);
}
After all this, is my approach correct? I need to make millions of requests, will take some weeks/months to gather all data I need (due to the connection limit).
After 12 - 14 hours, it just stops and I need to manually restart the app (memory usage is ok, my VPS has 1 GB and it never used more than 60%).
I've used the below code from this post - What is the best way to cal API calls in parallel in .net Core, C#?
It works fine, but when I'm processing a large list, some of the calls fail.
My question is, how can I implement Retry logic into this?
foreach (var post in list)
{
async Task<string> func()
{
var response = await client.GetAsync("posts/" + post);
return await response.Content.ReadAsStringAsync();
}
tasks.Add(func());
}
await Task.WhenAll(tasks);
var postResponses = new List<string>();
foreach (var t in tasks) {
var postResponse = await t; //t.Result would be okay too.
postResponses.Add(postResponse);
Console.WriteLine(postResponse);
}
This is my attempt to use Polly. It doesn't work as it still fails on around the same amount of requests as before.
What am I doing wrong?
var policy = Policy
.Handle<HttpRequestException>()
.RetryAsync(3);
foreach (var mediaItem in uploadedMedia)
{
var mediaRequest = new HttpRequestMessage { *** }
async Task<string> func()
{
var response = await client.SendAsync(mediaRequest);
return await response.Content.ReadAsStringAsync();
}
tasks.Add(policy.ExecuteAsync(() => func()));
}
await Task.WhenAll(tasks);
I'm trying to send multiple same requests at (almost) once to my WebAPI to do some performance testing.
For this, I am calling PerformRequest multiple times and wait for them using await Task.WhenAll.
I want to calculate the time that each request takes to complete plus the start time of each one of them. In my code,however, I don't know what happens if the result of R3 (request number 3) comes before R1? Would the duration be wrong?
From what I see in the results, I think the results are mixing with each other. For example, the R4's result sets as R1's result. So any help would be appreciated.
GlobalStopWatcher is a static class that I'm using to find the start time of each request.
Basically I want to make sure that elapsedMilliseconds and Duration of each request is associated with the request itself.
So that if the result of request 10th comes before the result of 1st request, then duration would be duration = elapsedTime(10th)-(startTime(1st)). Isn't that the case?
I wanted to add a lock but it seems impossible to add it where there's await keyword.
public async Task<RequestResult> PerformRequest(RequestPayload requestPayload)
{
var url = "myUrl.com";
var client = new RestClient(url) { Timeout = -1 };
var request = new RestRequest { Method = Method.POST };
request.AddHeaders(requestPayload.Headers);
foreach (var cookie in requestPayload.Cookies)
{
request.AddCookie(cookie.Key, cookie.Value);
}
request.AddJsonBody(requestPayload.BodyRequest);
var st = new Stopwatch();
st.Start();
var elapsedMilliseconds = GlobalStopWatcher.Stopwatch.ElapsedMilliseconds;
var result = await client.ExecuteAsync(request).ConfigureAwait(false);
st.Stop();
var duration = st.ElapsedMilliseconds;
return new RequestResult()
{
Millisecond = elapsedMilliseconds,
Content = result.Content,
Duration = duration
};
}
public async Task RunAllTasks(int numberOfRequests)
{
GlobalStopWatcher.Stopwatch.Start();
var arrTasks = new Task<RequestResult>[numberOfRequests];
for (var i = 0; i < numberOfRequests; i++)
{
arrTasks[i] = _requestService.PerformRequest(requestPayload, false);
}
var results = await Task.WhenAll(arrTasks).ConfigureAwait(false);
RequestsFinished?.Invoke(this, results.ToList());
}
Where I think you're going wrong with this is trying to use a static GlobalStopWatcher and then pushing this code into your function that you're testing.
You should keep everything separate and use a new instance of Stopwatch for each RunAllTasks call.
Let's make it so.
Start with these:
public async Task<RequestResult<R>> ExecuteAsync<R>(Stopwatch global, Func<Task<R>> process)
{
var s = global.ElapsedMilliseconds;
var c = await process();
var d = global.ElapsedMilliseconds - s;
return new RequestResult<R>()
{
Content = c,
Millisecond = s,
Duration = d
};
}
public class RequestResult<R>
{
public R Content;
public long Millisecond;
public long Duration;
}
Now you're in a position to test anything that fits the signature of Func<Task<R>>.
Let's try this:
public async Task<int> DummyAsync(int x)
{
await Task.Delay(TimeSpan.FromSeconds(x % 3));
return x;
}
We can set up a test like this:
public async Task<RequestResult<int>[]> RunAllTasks(int numberOfRequests)
{
var sw = Stopwatch.StartNew();
var tasks =
from i in Enumerable.Range(0, numberOfRequests)
select ExecuteAsync<int>(sw, () => DummyAsync(i));
return await Task.WhenAll(tasks).ConfigureAwait(false);
}
Note that the line var sw = Stopwatch.StartNew(); captures a new Stopwatch for each RunAllTasks call. Nothing is actually "global" anymore.
If I execute that with RunAllTasks(7) then I get this result:
It runs and it counts correctly.
Now you can just refactor your PerformRequest method to just do what it needs to:
public async Task<string> PerformRequest(RequestPayload requestPayload)
{
var url = "myUrl.com";
var client = new RestClient(url) { Timeout = -1 };
var request = new RestRequest { Method = Method.POST };
request.AddHeaders(requestPayload.Headers);
foreach (var cookie in requestPayload.Cookies)
{
request.AddCookie(cookie.Key, cookie.Value);
}
request.AddJsonBody(requestPayload.BodyRequest);
var response = await client.ExecuteAsync(request);
return response.Content;
}
Running the tests is easy:
public async Task<RequestResult<string>[]> RunAllTasks(int numberOfRequests)
{
var sw = Stopwatch.StartNew();
var tasks =
from i in Enumerable.Range(0, numberOfRequests)
select ExecuteAsync<string>(sw, () => _requestService.PerformRequest(requestPayload));
return await Task.WhenAll(tasks).ConfigureAwait(false);
}
If there's any doubt about the thread-safety of Stopwatch then you could do this:
public async Task<RequestResult<R>> ExecuteAsync<R>(Func<long> getMilliseconds, Func<Task<R>> process)
{
var s = getMilliseconds();
var c = await process();
var d = getMilliseconds() - s;
return new RequestResult<R>()
{
Content = c,
Millisecond = s,
Duration = d
};
}
public async Task<RequestResult<int>[]> RunAllTasks(int numberOfRequests)
{
var sw = Stopwatch.StartNew();
var tasks =
from i in Enumerable.Range(0, numberOfRequests)
select ExecuteAsync<int>(() => { lock (sw) { return sw.ElapsedMilliseconds; } }, () => DummyAsync(i));
return await Task.WhenAll(tasks).ConfigureAwait(false);
}
I am trying to call HttpClient request inside for loop as follows. It needs to do multiple consecutive calls to third party rest api.
But it only gives me fist service call result while loop exit before getting result from rest of the service call.
private void Search()
{
try
{
var i = 1;
using (var httpClient = new HttpClient())
{
while (i < 5)
{
string url = "https://jsonplaceholder.typicode.com/posts/" + i;
var response = httpClient.GetAsync(url).Result;
string jsonResult = response.Content.ReadAsStringAsync().Result;
Console.WriteLine(jsonResult.ToString());
i++;
}
}
}
catch (Exception ex)
{
Console.WriteLine(ex.ToString());
}
}
When I run with debug points the program gives me all the result. But when I run it without debug points it gives me only the first result.
I tried this with using async, await methods too. It also gives me same result.
As I feel Program needs to wait until the async call returns data.
Please help me to solve this.
EDIT - async way
private async Task<string> SearchNew()
{
try
{
var i = 1;
var res = string.Empty;
using (var httpClient = new HttpClient())
{
while (i < 5)
{
string url = "https://jsonplaceholder.typicode.com/posts/" + i;
var response = httpClient.GetAsync(url).Result;
string jsonResult = await response.Content.ReadAsStringAsync();
res = res + jsonResult + " --- ";
i++;
}
}
return res;
}
catch (Exception ex)
{
return ex.Message;
}
}
Both are giving same result.
There's a few things here that you should be doing. First, move the HttpClient creation outside of your method and make it static. You only need one of them and having multiple can be really bad for stability (see here):
private static HttpClient _client = new HttpClient();
Next, extract the calls to the HttpClient into a single method, something simple like this:
//Please choose a better name than this
private async Task<string> GetData(string url)
{
var response = await _client.GetAsync(url);
return await response.Content.ReadAsStringAsync();
}
And finally, you create a list of tasks and wait for them all to complete asynchronously using Task.WhenAll:
private async Task<string[]> SearchAsync()
{
var i = 1;
var tasks = new List<Task<string>>();
//Create the tasks
while (i < 5)
{
string url = "https://jsonplaceholder.typicode.com/posts/" + i;
tasks.Add(GetData(url));
i++;
}
//Wait for the tasks to complete and return
return await Task.WhenAll(tasks);
}
And to call this method:
var results = await SearchAsync();
foreach (var result in results)
{
Console.WriteLine(result);
}
I would like to process a list of 50,000 urls through a web service, The provider of this service allows 5 connections per second.
I need to process these urls in parallel with adherence to provider's rules.
This is my current code:
static void Main(string[] args)
{
process_urls().GetAwaiter().GetResult();
}
public static async Task process_urls()
{
// let's say there is a list of 50,000+ URLs
var urls = System.IO.File.ReadAllLines("urls.txt");
var allTasks = new List<Task>();
var throttler = new SemaphoreSlim(initialCount: 5);
foreach (var url in urls)
{
await throttler.WaitAsync();
allTasks.Add(
Task.Run(async () =>
{
try
{
Console.WriteLine(String.Format("Starting {0}", url));
var client = new HttpClient();
var xml = await client.GetStringAsync(url);
//do some processing on xml output
client.Dispose();
}
finally
{
throttler.Release();
}
}));
}
await Task.WhenAll(allTasks);
}
Instead of var client = new HttpClient(); I will create a new object of the target web service but this is just to make the code generic.
Is this the correct approach to handle and process a huge list of connections? and is there anyway I can limit the number of established connections per second to 5 as the current implementation will not consider any timeframe?
Thanks
Reading values from web service is IO operation which can be done asynchronously without multithreading.
Threads do nothing - only waiting for response in this case. So using parallel is just wasting of resources.
public static async Task process_urls()
{
var urls = System.IO.File.ReadAllLines("urls.txt");
var allTasks = new List<Task>();
var throttler = new SemaphoreSlim(initialCount: 5);
foreach (var urlGroup in SplitToGroupsOfFive(urls))
{
var tasks = new List<Task>();
foreach(var url in urlGroup)
{
var task = ProcessUrl(url);
tasks.Add(task);
}
// This delay will sure that next 5 urls will be used only after 1 seconds
tasks.Add(Task.Delay(1000));
await Task.WhenAll(tasks.ToArray());
}
}
private async Task ProcessUrl(string url)
{
using (var client = new HttpClient())
{
var xml = await client.GetStringAsync(url);
//do some processing on xml output
}
}
private IEnumerable<IEnumerable<string>> SplitToGroupsOfFive(IEnumerable<string> urls)
{
var const GROUP_SIZE = 5;
var string[] group = null;
var int count = 0;
foreach (var url in urls)
{
if (group == null)
group = new string[GROUP_SIZE];
group[count] = url;
count++;
if (count < GROUP_SIZE)
continue;
yield return group;
group = null;
count = 0;
}
if (group != null && group.Length > 0)
{
yield return group.Take(group.Length);
}
}
Because you mention that "processing" of response is also IO operation, then async/await approach is most efficient, because it using only one thread and process other tasks when previous tasks waiting for response from web service or from file writing IO operations.