Parallelize C# Graph API SDK methods - c#

I'm connecting to and fetching transitive groups data from MS Graph API via. following logic:
var queryOptions = new List<QueryOption>()
{
new QueryOption("$count", "true")
};
var lstTemp = graphClient.Groups[$"{groupID}"].TransitiveMembers
.Request(queryOptions)
.Header("ConsistencyLevel", "eventual")
.Select("id,mail,onPremisesSecurityIdentifier").Top(999)
.GetAsync().GetAwaiter().GetResult();
var lstGroups = lstTemp.CurrentPage.Where(x => x.ODataType.Contains("group")).ToList();
while (lstTemp.NextPageRequest != null)
{
lstTemp = lstTemp.NextPageRequest.GetAsync().GetAwaiter().GetResult();
lstGroups.AddRange(lstTemp.CurrentPage.Where(x => x.ODataType.Contains("group")).ToList());
}
Although the following logic works fine, for larger data set where the result count could be around 10K records or more, I've noticed the time required to fetch all of the results is around 10-12 seconds.
I'm looking for a solution by which we can parallelize (or multi-threading/tasking) API calls are executed in such a way that the overall time to get completed results is further reduced.
In C# we have Parallel.For etc. can I use it in this scenario to replace my regular While loop mentioned above?
Any suggestions?

Not really using the Parallel.For api, but you can execute a bunch of asynchronous tasks concurrently by throwing them into a List<Task<T>> and awaiting the whole list with Task.WhenAll. Your code may look something like this:
var queryOptions = new List<QueryOption>()
{
new QueryOption("$count", "true")
};
// Creating the first request
var firstRequest = graphClient.Groups[$"{groupID}"].TransitiveMembers
.Request(queryOptions)
.Header("ConsistencyLevel", "eventual")
.Select("id,mail,onPremisesSecurityIdentifier").Top(999)
.GetAsync();
// Creating a list of all requests (starting with the first one)
var requests = new List<Task<IGroupTransitiveMembersCollectionWithReferencesPage>>() { firstRequest };
// Awaiting the first response
var firstResponse = await firstRequest;
// Getting the total count from the request
var count = (int) firstResponse.AdditionalData["#odata.count"];
// Setting offset to the amount of data you already pulled
var offset = 999;
while (offset < count)
{
// Creating the next request
var nextRequest = graphClient.Groups[$"{groupID}"].TransitiveMembers
.Request() // Notice no $count=true (may potentially hurt performance and we don't need it anymore anyways)
.Header("ConsistencyLevel", "eventual")
.Select("id,mail,onPremisesSecurityIdentifier")
.Skip(offset).Top(999) // Skipping the data you already pulled
.GetAsync();
// Adding it to the list
requests.Add(nextRequest);
// Increasing the offset
offset += 999;
}
// Waiting for all the requests to finish
var allResponses = await Task.WhenAll(requests);
// This flattens the list while filtering as you did
allResponses
.Select(x => x.CurrentPage)
.SelectMany(x => x.Where(x => x.ODataType.Contains("group")));
Couldn't check if this code works without a Graph tenant, so you might need to modify a bit, but I hope you can see the general idea.
Also I allowed myself to refactor the code to use proper async/await since it's good and standard practice to do that, but it should work with .GetAwaiter().GetResult() if you can't use await in your context for some reason (please consider, though).

Related

Client-side request rate-limiting

I'm designing a .NET client application for an external API. It's going to have two main responsibilities:
Synchronization - making a batch of requests to API and saving responses to my database periodically.
Client - a pass-through for requests to API from users of my client.
Service's documentation specifies following rules on maximum number of requests that can be issued in given period of time:
During a day:
Maximum of 6000 requests per hour (~1.67 per second)
Maximum of 120 requests per minute (2 per second)
Maximum of 3 requests per second
At night:
Maximum of 8000 requests per hour (~2.23 per second)
Maximum of 150 requests per minute (2.5 per second)
Maximum of 3 requests per second
Exceeding these limits won't result in immediate lockdown - no exception will be thrown. But provider can get annoyed, contact us and then ban us from using his service. So I need to have some request delaying mechanism in place to prevent that. Here's how I see it:
public async Task MyMethod(Request request)
{
await _rateLimter.WaitForNextRequest(); // awaitable Task with calculated Delay
await _api.DoAsync(request);
_rateLimiter.AppendRequestCounters();
}
Safest and simpliest option would be to respect the lowest rate limit only, that is of max 3 requests per 2 seconds. But because of "Synchronization" responsibility, there is a need to use as much of these limits as possible.
So next option would be to to add a delay based on current request count. I've tried to do something on my own and I also have used RateLimiter by David Desmaisons, and it would've been fine, but here's a problem:
Assuming there will be 3 requests per second sent by my client to the API at day, we're going to see:
A 20 second delay every 120th request
A ~15 minute delay every 6000th request
This would've been acceptable if my application was only about "Synchronization", but "Client" requests can't wait that long.
I've searched the Web, and I've read about token/leaky bucket and sliding window algorithms, but I couldn't translate them to my case and .NET, since they mainly cover the rejecting of requests that exceed a limit. I've found this repo and that repo, but they are both only service-side solutions.
QoS-like spliting of rates, so that "Synchronization" would have the slower, and "Client" the faster rate, is not an option.
Assuming that current request rates will be measured, how to calculate the delay for next request so that it could be adaptive to current situation, respect all maximum rates and wouldn't be longer than 5 seconds? Something like gradually slowing down when approaching a limit.
This is achievable by using the Library you linked on GitHub. We need to use a composed TimeLimiter made out of 3 CountByIntervalAwaitableConstraint like so:
var hourConstraint = new CountByIntervalAwaitableConstraint(6000, TimeSpan.FromHours(1));
var minuteConstraint = new CountByIntervalAwaitableConstraint(120, TimeSpan.FromMinutes(1))
var secondConstraint = new CountByIntervalAwaitableConstraint(3, TimeSpan.FromSeconds(1));
var timeLimiter = TimeLimiter.Compose(hourConstraint, minuteConstraint, secondConstraint);
We can test to see if this works by doing this:
for (int i = 0; i < 1000; i++)
{
await timeLimiter;
Console.WriteLine($"Iteration {i} at {DateTime.Now:T}");
}
This will run 3 times every second until we reach 120 iterations (iteration 119) and then wait until the minute is over and the continue running 3 times every second. We can also (again using the Library) easily use the TimeLimiter with a HTTP Client by using the AsDelegatingHandler() extension method provided like so:
var handler = TimeLimiter.Compose(hourConstraint, minuteConstraint, secondConstraint);
var client = new HttpClient(handler);
We can also use CancellationTokens, but as far as I can tell not at the same time as also using it as the handler for the HttpClient. Here is how you can use it with a HttpClientanyways:
var timeLimiter = TimeLimiter.Compose(hourConstraint, minuteConstraint, secondConstraint);
var client = new HttpClient();
for (int i = 0; i < 100; i++)
{
await composed.Enqueue(async () =>
{
var client = new HttpClient();
var response = await client.GetAsync("https://hacker-news.firebaseio.com/v0/item/8863.json?print=pretty");
if (response.IsSuccessStatusCode)
Console.WriteLine(await response.Content.ReadAsStringAsync());
else
Console.WriteLine($"Error code {response.StatusCode} reason: {response.ReasonPhrase}");
}, new CancellationTokenSource(TimeSpan.FromSeconds(10)).Token);
}
Edit to address OPs question more:
If you want to make sure a User can send a request without having to wait for the limit to be over with, we would need to dedicate a certain amount of request every second/ minute/ hour to our user. So we need a new TimeLimiter for this and also adjust our API TimeLimiter. Here are the two new ones:
var apiHourConstraint = new CountByIntervalAwaitableConstraint(5500, TimeSpan.FromHours(1));
var apiMinuteConstraint = new CountByIntervalAwaitableConstraint(100, TimeSpan.FromMinutes(1));
var apiSecondConstraint = new CountByIntervalAwaitableConstraint(2, TimeSpan.FromSeconds(1));
// TimeLimiter for calls automatically to the API
var apiTimeLimiter = TimeLimiter.Compose(apiHourConstraint, apiMinuteConstraint, apiSecondConstraint);
var userHourConstraint = new CountByIntervalAwaitableConstraint(500, TimeSpan.FromHours(1));
var userMinuteConstraint = new CountByIntervalAwaitableConstraint(20, TimeSpan.FromMinutes(1));
var userSecondConstraint = new CountByIntervalAwaitableConstraint(1, TimeSpan.FromSeconds(1));
// TimeLimiter for calls made manually by a user to the API
var userTimeLimiter = TimeLimiter.Compose(userHourConstraint, userMinuteConstraint, userSecondConstraint);
You can play around with the numbers to suit your need.
Now to use it:
I saw you're using a central Method to execute your Requests, this makes it easier. I'll just add an optional boolean parameter that determines if it's an automatically executed request or one made from a user. (You could replace this parameter with an Enum if you want more than just automatic and manual requests)
public static async Task DoRequest(Request request, bool manual = false)
{
TimeLimiter limiter;
if (manual)
limiter = TimeLimiterManager.UserLimiter;
else
limiter = TimeLimiterManager.ApiLimiter;
await limiter;
_api.DoAsync(request);
}
static class TimeLimiterManager
{
public static TimeLimiter ApiLimiter { get; }
public static TimeLimiter UserLimiter { get; }
static TimeLimiterManager()
{
var apiHourConstraint = new CountByIntervalAwaitableConstraint(5500, TimeSpan.FromHours(1));
var apiMinuteConstraint = new CountByIntervalAwaitableConstraint(100, TimeSpan.FromMinutes(1));
var apiSecondConstraint = new CountByIntervalAwaitableConstraint(2, TimeSpan.FromSeconds(1));
// TimeLimiter to control access to the API for automatically executed requests
ApiLimiter = TimeLimiter.Compose(apiHourConstraint, apiMinuteConstraint, apiSecondConstraint);
var userHourConstraint = new CountByIntervalAwaitableConstraint(500, TimeSpan.FromHours(1));
var userMinuteConstraint = new CountByIntervalAwaitableConstraint(20, TimeSpan.FromMinutes(1));
var userSecondConstraint = new CountByIntervalAwaitableConstraint(1, TimeSpan.FromSeconds(1));
// TimeLimiter to control access to the API for manually executed requests
UserLimiter = TimeLimiter.Compose(userHourConstraint, userMinuteConstraint, userSecondConstraint);
}
}
This isn't perfect, as when the user doesn't execute 20 API calls every minute but your automated system needs to execute more than 100 every minute it will have to wait.
And regarding day/ night differences: You can use 2 backing fields for the Api/UserLimiter and return the appropriate ones in the { get {...} } of the property

Threading in c# while making put calls

I am new to threading world of c#. I read there are different ways to do threading like sequential.
My scenario is below. Which one would be more suitable for the below.
I have list of complex objects. I will be making calls to PUT endpoint for each object [body of put] separately. There can be 1000 or more objects in the list. And I cannot pass all the objects at one and hence I have to pass each object in every call to the put endpoint. In this way, I have to make 1000 calls separately if there are 1000 objects.
Each put call is independent of each other while I have to store the properties of the response back from each call.
I was thinking to apply threading concept to above but not sure which one and how to do it.
Any suggestions would be greatly appreciated.
Thanking in advance.
As per the comments below,
Putting the method signatures here and adding more details.
I have IEnumerable<CamelList>. For each camel, I have to make a put request call and update the table from the response of each call. I will write a new method that will accept this list and make use of below 2 methods to make call and update table. I have to ensure, I am making not more than 100 calls at the same time and the API I am calling can be called by the same user 100 times per minute.
We have a method as
public Camel SendRequest(handler, uri, route, Camel); //basically takes all the parameters and provide you the Camel.
We have a method as public void updateInTable(Entity Camel); //updates the table.
HTTP calls are typically made using the HttpClient class, whose HTTP methods are already asynchronous. You don't need to create your own threads or tasks.
All asynchronous methods return a Task or Task<T> value. You need to use theawaitkeyword to await for the operation to complete asynchronously - that means the thread is released until the operation completes. When that happens, execution resumes after theawait`.
You can see how to write a PUT request here. The example uses the PutAsJsonAsync method to reduce the boilerplate code needed to serialize a Product class into a string and create a StringContent class with the correct content type, eg:
var response = await client.PutAsJsonAsync($"api/products/{product.Id}", product);
response.EnsureSuccessStatusCode();
If you want to PUT 1000 products, all you need is an array or list with the products. You can use LINQ to make multiple calls and await the tasks they return at the end :
var callTasks = myProducts.Select(product=>client.PutAsJsonAsync($"api/products/{product.Id}", product);
var responses = await Task.WhenAll(callTasks);
This means that you have to wait for all requests to finish before you can check if any one succeeded. You can change the body of Select to await the response itself :
var callTasks = myProducts.Select(async product=>{
var response=await client.PutAsJsonAsync($"api/products/{product.Id}", product);
if (!response.IsSuccessStatusCode)
{
//Log the error
}
return response.StatusCode;
});
var responses=await Task.WhenAll(callTasks);
It's better to conver the lambda into a separate method though, eg PutProductAsync :
async Task<HttpStatusCode> PutProduct(Product product,HttpClient client)
{
var response=await client.PutAsJsonAsync($"api/products/{product.Id}", product);
if (!response.IsSuccessStatusCode)
{
//Log the error
}
return response.StatusCode;
};
var callTasks = myProducts.Select(product=>PutProductAsync(product));
var responses=await Task.WhenAll(callTasks);
I'm going to suggest using Microsoft's Reactive Framework for this. You need to NuGet "System.Reactive" to get the bits.
Then you can do this:
var urls = new string[1000]; //somehow populated;
Func<string, HttpContent, IObservable<string>> putCall = (u, c) =>
Observable
.Using(
() => new HttpClient(),
hc =>
from resp in Observable.FromAsync(() => hc.PutAsync(u, c))
from body in Observable.FromAsync(() => resp.Content.ReadAsStringAsync())
select body);
var callsPerTimeSpanAllowed = 100;
var timeSpanAllowed = TimeSpan.FromMinutes(1.0);
IObservable<IList<string>> bufferedIntervaledUrls =
Observable.Zip(
Observable.Interval(timeSpanAllowed),
urls.ToObservable().Buffer(callsPerTimeSpanAllowed),
(_, buffered_urls) => buffered_urls);
var query =
from bufferedUrls in bufferedIntervaledUrls
from url in bufferedUrls
from result in putCall(url, new StringContent("YOURCONTENTHERE"))
select new { url, result };
IDisposable subscription =
query
.Subscribe(
x => { /* do something with each `x.url` & `x.result` */ },
() => { /* do something when it is all finished */ });
This code is breaking the URLs into blocks (or buffers) of 100 and putting them on a timeline (or interval) of 1 minute apart. It then calls the putCall for each URL and returns the result.
It's probably a little advanced for you now, but I thought this answer might be useful just to see how clean this can be.

Get Geolocation information from external API using Parallel programming

I am using postcodes.io api to get Geolocation information based on postcodes. It's working fine. But this API has a limitation that it can accept only 100 postcodes per request. I have around 300 postcodes. I am thinking of making calls to API parallely and aggregate the response after getting everything.
var httpClient = _httpClientFactory.GetHttpClient("GeolocationAPI", postCodeServiceUrl, _httpLogger);
// chunks the list by 100 items and returns the collection
var postcodeChunks = postcodes.ChunkBy(100);
postcodeChunks.ForEach(
postcodeList =>
{
var response = httpClient
.Post
<MultipleGeolocationInfoRequest,
GetGeolocationResult<GetGeolocationResult<GeolocationInfo>[]>>(
"postcodes",
new MultipleGeolocationInfoRequest {Postcodes = postcodeList.ToArray()});
}
);
ChunkBy extension method is returning the list of lists and each list containing 100 postcodes.
I am facing difficulty in aggregating the return response and handling exceptions if any. The response from each call is also a collection.
API : http://postcodes.io/
To aggregate results you can collect them into a collection of results first and than SelectMany:
List<Poscode[]> responses = new List<Poscode[]>();
postcodeChunks.ForEach(
postcodeList =>
{
try // handle exception for individual call
{
var response = httpClient.Post(....);
responses.Add(response);
}
catch(Exception ex)
{
// do something sensible with exception for request
// continue or abort (with throw)
}
}
// merge all results into single list
var aggregated = results.SelectMany(x => x).ToList();
Note that your code is not parallel at all - consider using Parallel.ForEach or asycn based code with Task.WhenAll to run all tasks in parallel.

Find Result of Parallel Async Tasks

Based off this question I'm trying to set up code to save several images to Azure Blob Storage in parallel. This method below works fine and awaiting Task.WhenAll(tasks) awaits for all to complete before continuing.
The only trouble is, I would like to be able to find out if each request to store the information in our database actually succeeded. _db.AddImageAsync returns a bool and the code below waits for all tasks to complete but when I check the result of all the tasks each is false (even if I actually returned true inside the brackets).
Each task in the Enumerable says the result has not yet been computed even though I stepped through with breakpoints and each has been carried out.
var tasks = wantedSizes.Select(async (wantedSize, index) =>
{
var resize = size.CalculateResize(wantedSize.GetMaxSize());
var quality = wantedSize.GetQuality();
using (var output = ImageProcessHelper.Process(streams[index], resize, quality))
{
var path = await AzureBlobHelper.SaveFileAsync(output, FileType.Image);
var result = await _db.AddImageAsync(id, wantedSize, imageNumber, path);
return result;
}
});
await Task.WhenAll(tasks)
if (!tasks.All(task => task.Result))
return new ApiResponse(ResponseStatus.Fail);
Any help is much appreciated!
Because .Select( is lazy evaluated and returns a IEnumerable<Task<bool>> you are causing the .Select( to be run multiple times when you iterate over the result multiple times. Throw a .ToList() on it to make it a List<Task<bool>> and that will only execute the .Select( once and the multiple enumerations will be over the returned List<Task<bool>> which will not have side effects.
var tasks = wantedSizes.Select(async (wantedSize, index) =>
{
var resize = size.CalculateResize(wantedSize.GetMaxSize());
var quality = wantedSize.GetQuality();
using (var output = ImageProcessHelper.Process(streams[index], resize, quality))
{
var path = await AzureBlobHelper.SaveFileAsync(output, FileType.Image);
//Double check your documentation, is _db.AddImageAsync thread safe?
var result = await _db.AddImageAsync(id, wantedSize, imageNumber, path);
return result;
}
}).ToList(); //We run the Select once here to process the .ToList().
await Task.WhenAll(tasks) //This is the first enumeration of the variable "tasks".
if (!tasks.All(task => task.Result)) //This is a 2nd enumeration of the variable.
return new ApiResponse(ResponseStatus.Fail);

How to do async 'paged' processing of EF select result

I'm writing something that loads records from SQL server onto an azure queue. The thing is, the number of items in the select result might be very large, so I would like to start the queuing stuff while data is still being retrieved.
I'm trying to leverage EF6 (async), all async methods and and TPL for parallel enqueueing. So I have:
// This defines queue that Generator will publsh to and
// QueueManager wil read from. More info:
// http://msdn.microsoft.com/en-us/library/hh228601(v=vs.110).aspx
var queue = new BufferBlock<ProcessQueueItem>();
// Configure queue listener first
var result = this.ReceiveAndEnqueue(queue);
// Start generation process
var tasks = generator.Generate(batchId);
The ReceiveAndEnqueue is simple:
private async Task ReceiveAndEnqueue(ISourceBlock<ProcessQueueItem> queue)
{
while (await queue.OutputAvailableAsync())
{
var processQueueItem = await queue.ReceiveAsync();
await this.queueManager.Enqueue(processQueueItem);
this.tasksEnqueued++;
}
}
The generator generate() signature is as follows:
public void Generate(Guid someId, ITargetBlock<ProcessQueueItem> target)
Which calls the SendAsync() method on the target to place new items. What I'm doing right now is dividing the total number of results into 'batches', loading them in, and sending them async, untill all is done:
public void Generate(Guid batchId, ITargetBlock<ProcessQueueItem> target)
{
var accountPromise = this.AccountStatusRepository.GetAccountsByBatchId(batchId.ToString());
accountPromise.Wait();
var accounts = accountPromise.Result;
// Batch configuration
var itemCount = accounts.Count();
var numBatches = (int)Math.Ceiling((double)itemCount / this.batchSize);
Debug.WriteLine("Found {0} items what will be put in {1} batches of {2}", itemCount, numBatches, this.batchSize);
for (int i = 0; i < numBatches; i++)
{
var itemsToTake = Math.Min(this.batchSize, itemCount - currentIndex);
Debug.WriteLine("Running batch - skip {0} and take {1}", currentIndex, itemsToTake);
// Take a subset of the items and place them onto the queue
var batch = accounts.Skip(currentIndex).Take(itemsToTake);
// Generate a list of tasks to enqueue the items
var taskList = new List<Task>(itemsToTake);
taskList.AddRange(batch.Select(account => target.SendAsync(account.AsProcessQueueItem(batchId))));
// Return the control when all tasks have been enqueued
Task.WaitAll(taskList.ToArray());
currentIndex = currentIndex + this.batchSize;
}
This works however, my colleague remarked - 'can't we make the interface simpler, and let Generate() and make the interface like so:
public Task<IEnumerable<ProcessQueueItem> Generate(Guid someId)
A lot cleaner, and no dependency of the Generate method onto the TPL library. I totally agree, I'm just affraid that if I do that, I'm going to have to call
var result = Generate().Wait().Result;
at some point, before enqueuinig all the items. That will make me wait untill ALL the stuff is loaded in and is in memory.
So what my question comes down is: how can I start using EF query results as soon as they 'drip in' from a select? As if EF would run a 'yield' over the results if you catch my drift.
EDIT
I think I made a thinking mistake. EF loads items lazy by default. So I can just return all the results as IQueryable<> but that doesn't mean they're actually loaded from the DB. I'll then iterate over them and enqueue them.
EDIT 2
Nope, that doesn't work, since I need to transform the object from the database in the Generate() method...
OK, this is what I ended up with:
public IEnumerable<ProcessQueueItem> Generate(Guid batchId)
{
var accounts = this.AccountStatusRepository.GetAccountsByBatchId(batchId.ToString());
foreach (var accountStatuse in accounts)
{
yield return accountStatuse.AsProcessQueueItem(batchId);
}
}
The repository returns an IEnumerable of just some DataContext.Stuff.Where(...). The generator uses the extension method to transform the entity to the domain model (ProcessQueueItem) which by the means of yield is immediately sent to the caller of the method, that will start calling the QueueManager to start queueing.

Categories