C# batch processing of async web responses hangs just before finishing

C# batch processing of async web responses hangs just before finishing - c#

Here is the scenario.
I want to call 2 versions of an API (hosted on different servers), then cast their responses (they come as a JSON) to C# objects and compare them.
An important note here is that i need to query the APIs a lot of times ~3000. The reason for this is that I query an endpoint that has an id and that returns a specific object from the DB. So my queries are like http://myapi/v1/endpoint/id. And I basically use a loop to go through all of the ids.
Here is the issue
I start querying the API and for the first 90% of all requests it is blazing fast (I get the response and i process it) and all that happens under 5 seconds.
Then however, I start to come to a stop. The next 50-100 requests can take between 1 - 5 seconds to process and after that I come to a stop. No CPU-usage, network activity is low (and I am pretty sure that activity is from other apps). And my app just hangs.
UPDATE: Around 50% of the times I tested this, it does finally resume after quite a bit of time. But the other 50% it still just hangs
Here is what I am doing in-code
I have a list of Ids that I iterate to query the endpoint.
This is the main piece of code that queries the APIs and processes the responses.
var endPointIds = await GetIds(); // this queries a different endpoint to get all ids, however there are no issues with it
var tasks = endPointIds.Select(async id =>
{
var response1 = await _data.GetData($"{Consts.ApiEndpoint1}/{id}");
var response2 = await _data.GetData($"{Consts.ApiEndpoint2}/{id}");
return ProcessResponces(response1, response2);
});
var res = await Task.WhenAll(tasks);
var result = res.Where(r => r != null).ToList();
return result; // I never get to return the result, the app hangs before this is reached
This is the GetData() method
private async Task<string> GetAsync(string serviceUri)
{
try
{
var request = WebRequest.CreateHttp(serviceUri);
request.ContentType = "application/json";
request.Method = WebRequestMethods.Http.Get;
using (var response = await request.GetResponseAsync())
using (var responseStream = response.GetResponseStream())
using (var streamReader = new StreamReader(responseStream, Encoding.UTF8))
{
return await streamReader.ReadToEndAsync();
}
}
catch
{
return string.Empty;
}
}
I would link the ProcessResponces method as well, however I tried mocking it to return a string like so:
private string ProcessResponces(string responseJson1, string responseJson1)
{
//usually i would have 2 lines that deserialize responseJson1 and responseJson1 here using Newtonsoft.Json's DeserializeObject<>
return "Fake success";
}
And even with this implementation my issue did not go away (only difference it made is that I managed the have fast requests for like 97% of my requests, but my code still ended up stopping at the last few request), so I am guessing the main issue is not related to that method. But what it more or less does is deserialize both responses to c# objects, compares them and returns information about their equality.
Here are my observations after 4 hours of debugging
If I manually reduce the number of queries to my API (I used .Take() method on the list of ids) the issue still persists. For example on 1000 total requests I start hanging around 900th, for 1500 on the 1400th an so on. I believe the issue goes away at around 100-200 requests, but I am not sure since it might just be too fast for me to notice.
Since this is currently a console app I tried adding WriteLines() in some of my methods, and the issue seemed to go away (I am guessing the delay in speed that writing on the console creates, gives some time between requests and that helps)
Lastly i did a concurrency profiling of my app and it reported that there were a lot of contentions happening at the point where my app hangs. Opening the contention tab showed that they are mainly happening with System.IO.StreamReader.ReadToEndAsync()
Thoughts and Questions
Obviously, what can I do to resolve the issue?
Is my GetAsync() method wrong, should I be using something else instead of responseStream and streamReader?
I am not super proficient in asynchronous operations, maybe my flow of async/await operations is wrong.
Lastly, could it be something with the API controllers themselves? They are standard ASP.NET MVC 5 WebAPI controllers (version 5.2.3.0)

After long hours of tracking my requests with Fiddler and finally mocking my DataProvider (_data) to retrieve locally, from disk - it turns out that I had responses that were taking 30s+ to come (or even not coming at all).
Since my .Select() is async it always dispalyed info for the quick responses first, and then came to a halt as it was waiting for the slow ones. This gave an illusion that I was somehow loading the first X amount of requests quickly and then stopping. When, in reality, I was simply shown the fastest X amount of requests and then coming to a halt as I was waiting for the slow ones.
And to kind of answer my questions...
What can I do to resolve the issue - set a timeout that allows a maximum number of milliseconds/seconds for a request to finish.
The GetAsync() method is alright.
Async/await operations are also correct, just need to have in mind that doign an async select will return results ordered by the time it took for them to finish.
The ASP.NET Framework controllers are perfectly fine and do not contribute to the issue.

Related

How to simulate long running process in .NET API Controller Correctly

I am trying to create a unit test to simulate my API being called by many people at the same time.
I've got this code in my unit test:
var tasks = new List<Task>();
for (int i = 0; i < 10; i++)
{
var id = i; // must assign to new variable inside for loop
var t = Task.Run(async () =>
{
response = await Client.GetAsync("/api/test2?id=" + id);
Assert.AreEqual(HttpStatusCode.OK, response.StatusCode);
});
tasks.Add(t);
}
await Task.WhenAll(tasks);
Then in my controller I am putting in a Thread.Sleep.
But when I do this, the total time for all tests to complete is 10 x the sleep time.
I expected all the calls to be made and to have ended up at the Thread.Sleep call at more or less the same time.
But it seems the API calls are actually made one after the other.
The reason I am testing the parallel API call is because I want to test a deadlock issue with my data repository when using SQLite which has only happened when more than 1 user uses my website at the same time.
And I have never been able to simulate this and I thought I'd create a unit test, but the code I have now seems to not be executing the calls in parallel.
My plan with the Thread.Sleep calls was to put a couple in the Controller method to make sure all requests end up between certain code blocks at the same time.
Do I need to set a a max number of parallel requests on the Web Server or something or am I doing something obviously wrong?
Thanks in advance.
Update 1:
I forgot to mention I get the same results with await Task.Delay(1000); and many similar alternatives.
Not sure if it's clear but this is all running within a unit test using NUnit.
And the "Web Server" and Client is created like this:
var builder = new WebHostBuilder().UseStartup<TStartup>();
Server = new TestServer(builder);
Client = Server.CreateClient();

You can use Task.Delay(time in milliseconds). Thread.Sleep will not release a thread and it can't process other tasks while waiting to result.

The HttpClient class in .NET has a limit of two concurrent requests to the same server by default, which I believe might be causing the issue in this case. Usually, this limit can be overridden by creating a new HttpClientHandler and using it as an argument in the constructor:
new HttpClient(new HttpClientHandler
{
MaxConnectionsPerServer = 100
})
But because the clients are created using the TestServer method, that gets a little more complicated. You could try changing the ServicePointManager.DefaultConnectionLimit property like below, but I'm not sure if that will work with the TestServer:
System.Net.ServicePointManager.DefaultConnectionLimit = 100;
That being said, I believe using Unit Tests for doing load testing is not a good approach and recommend looking into tools specific for load testing.
Reference for ServicePointManagerClass
This blog post also has more in-depth information about the subject

I found the problem with my test.
It was not the TestServer or the client code, it was the Database code.
In my controller I was starting an NHibernate Transaction, and that was blocking the requests because it would put a lock on the table being updated.
This is correct, so I had to change my code a bit to not automatically start a transaction. But rather leave that up to the calling code to manage.

HttpClient seems to be causing my app to slowdown every ~3 minutes while freeing up a ton of memory

So basically I am running a program which is able to send up to 7,000 HTTP requests every second in average, 24/7, in order to detect last changes on a website as quickly as possible.
However, every 2.5 to 3 minutes in average, my program slowdowns for around 10-15 seconds and goes from ~7K rq/s to less than 1000.
Here are logs from my program, where you can see the amount of requests it sends every second:
https://pastebin.com/029VLxZG
When scrolling down through the logs, you can see it goes slower every ~3 minutes. Example: https://i.imgur.com/US0wPzm.jpeg
At first I thought it was my server's ethernet connection going in a temporary "restricted" mode, and I even tried contacting my host about it. But then I ran 2 instances of my program simulteanously just to see what would happen and I noticed that, even though the issue (downtime) was happening on both, it wasn't always happening at the same time (depending on when the program was started, if you get what I mean), which meant the problem wasn't coming from the internet connection, but my program itself.
I investigated a little bit more, and found out that as soon as my program goes from ~7K rq/s to ~700, a lot of RAM is being freed up on my server.
I have taken 2 screenshots of the consecutive seconds before and once the downtime occurs (including RAM metrics), to compare, and you can view them here: https://imgur.com/a/sk2TYQZ (please note that I was using less threads here, which is why the average "normal" speed is ~2K rq/s instead of ~7K as mentioned before)
If you'd like to see more of it, here is the full record of the issue, in a video which lasts about 40 seconds: https://i.imgur.com/z27FlVP.mp4 - As you can see, after the RAM is freed up, its usage slowly goes up again, before the same process repeats every ~3 minutes.
For more context, here is the method I am using to send the HTTP requests (it is being called from a lot of threads concurrently, as my app is multi-threaded in order to be super fast):
public static async Task<bool> HasChangedAsync(string endpoint, HttpClient httpClient)
{
const string baseAddress = "https://example.com/";
string response = await httpClient.GetStringAsync(baseAddress + endpoint);
return response.Contains("example");
}
One thing I did is I tried replacing the whole method by await Task.Delay(25) then return false, and that fixed the issue, RAM usage was barely increasing.
This lead me to believe the issue is HttpClient / my HTTP requests, and even though I tried replacing the GetStringAsync method by GetAsync using both a HttpRequestMessage and HttpResponseMessage (and disposing them with using), the behavior ended up being the exact same.
So here I am, desperate for a fix, and without enough knowledge about memory, garbage collector etc (if that's even needed here) to be able to fix this myself.
Please, Stack Overflow, do you have any idea?
Thanks a lot.

Your best bet would be to stream the response and then use chunks of it to find what your are looking for. An example implementation could be something as follows:
using var response = await Client.GetAsync(BaseUrl, HttpCompletionOption.ResponseHeadersRead);
await using var stream = await response.Content.ReadAsStreamAsync();
using var reader = new StreamReader(stream);
string line = null;
while ((line = await reader.ReadLineAsync()) != null)
{
if(line.Contains("example"))// do whatever
}

How to crawl XML(s) very fast — considering the below networking limitations?

I have a .Net crawler that's running when the user makes a request (so, it needs to be fast). It crawls 400+ links in real time. (This is the business ask.)
The problem: I need to detect if a link is xml (think of rss or atom feeds) or html. If the link is xml then I continue with processing, but if the link is html I can skip it. Usually, I have 2 xml(s) and 398+ html(s). Currently, I have multiple threads going but the processing is still slow, usually 75 seconds running with 10 threads for 400+ links, or 280 seconds running with 1 thread. (I want to add more threads but see below..)
The challenge that I am facing is that I read the streams as follows:
var request = WebRequest.Create(requestUriString: uri.AbsoluteUri);
// ....
var response = await request.GetResponseAsync();
//....
using (var reader = new StreamReader(stream: response.GetResponseStream(), encoding: encoding)) {
char[] buffer = new char[1024];
await reader.ReadAsync(buffer: buffer, index: 0, count: 1024);
responseText = new string(value: buffer);
}
// parse first byts of reasponseText to check if xml
The problem is that my optimization to get only 1024 is quite useless because the GetResponseAsync is downloading the entire stream anyway, as I see.
(The other option that I have is to look for the header ContentType, but that's quite similar AFAIK because I get the content anyway - in case that you don't recommend to use OPTIONS, that I did not use so far - and in addition xml might be content-type incorrectly marked (?) and I am going to miss some content.)
If there is any optimization that I am missing please help, as I am running out of ideas.
(I do consider to optimize this design by spreading the load on multiple servers, so that I balance the network with the parallelism, but that's a bit of change from the current architecture, that I cannot afford to do at this point in time.)

Using HEAD requests could speed up the requests significantly, IF you can rely on the Content-Type.
e.g
HttpClient client = new HttpClient();
HttpResponseMessage response = await client.SendAsync(new HttpRequestMessage() { Method = HttpMethod.Head});
Just showing basic usage. Obviously you need to add uri and anything else required to the request.
Also just to note that even with 10 threads, 400 request will likely always take quite a while. 400/10 means 40 requests sequentially. Unless the requests are to servers close by then 200ms would be a good response time meaning a minimum of 8 seconds. Ovserseas serves that may be slow could easily push this out to 30-40 seconds of unavoidable delay, unless you increase the amount of threads to parallel more of the requests.
Dataflow (Task Parallel Library) Can be very helpful for writing parallel pipes with a convenient MaxDegreeOfParallelism property for easily adjusting how many parallel instances can be run.

That async-ing feeling - httpclient and mvc thread blocking

Dilemma, dilemma...
I've been working up a solution to a problem that uses async calls to the HttpClient library (GetAsync=>ConfigureAwait(false) etc). IIn a console app, my dll is very responsive and the mixture of using the async await calls and the Parallel.ForEach(=>) really makes me glow.
Now for the issue. After moving from this test harness to the target app, things have become problematic. I'm using asp.net mvc 4 and have hit a few issues. The main issue really is that calling my process on a controller action actually blocks the main thread until the async actions are complete. I've tried using an async controller pattern, I've tried using Task.Factory, I've tried using new Threads. You name it, I've tried all the flavours - and then some!.
Now, I appreciate that the nature of http is not designed to facilitate long processes like this and there are a number of articles here on SO that say don't do it. However, there are mitigating reasons why i NEED to use this approach. The main reason that I need to run this in mvc is due to the fact that I actually update the live data cache (on the mvc app) in realtime via raising an event in my dll's code. This means that fragments of the 50-60 data feeds can be pushed out live before the entire async action is complete. Therefore, client apps can receive partial updates within seconds of the async action being instigated. If I were to delegate the process out to a console app that ran the entire process in the background, I'd no longer be able to harness those fragment partial updates and this is the raison d'etre behind the entire choice of this architecture.
Can anyone shed light on a solution that would allow me to mitigate the blocking of the thread, whilst at the same time, allow each async fragment to be consumed by my object model and fed out to the client apps (I'm using signalr to make these client updates). A kind of nirvanna would be a scenario where an out-of-process cache object could be shared between numerous processes - the cache update could then be triggered and consumed by my mvc process (aka - http://devproconnections.com/aspnet-mvc/out-process-caching-aspnet). And so back to reality...
I have also considered using a secondary webservice to achieve this, but would welcome other options before once again over engineering my solution (there are already many moving parts and a multitude of async Actions going on).
Sorry not to have added any code, I'm hoping for practical philosophy/insights, rather than code help on this, tho would of course welcome coded examples that illustrate a solution to my problem.
I'll update the question as we move in time, as my thinking process is still maturing on this.
[edit] - for the sake of clarity, the snippet below is my brothers grimm code collision (extracted from a larger body of work):
Parallel.ForEach(scrapeDataBases, new ParallelOptions()
{
MaxDegreeOfParallelism = Environment.ProcessorCount * 15
},
async dataBase =>
{
await dataBase.ScrapeUrlAsync().ConfigureAwait(false);
await UpdateData(dataType, (DataCheckerScrape)dataBase);
});

async and Parallel.ForEach do not mix naturally, so I'm not sure what your console solution looks like. Furthermore, Parallel should almost never be used on ASP.NET at all.
It sounds like what you would want is to just use Task.WhenAll.
On a side note, I think your reasoning around background processing on ASP.NET is incorrect. It is perfectly possible to have a separate process that updates the clients via SignalR.

Being that your question is pretty high level without a lot of code. You could try Reactive Extensions.
Something like
private IEnumerable<Task<Scraper>> ScrappedUrls()
{
// Return the 50 to 60 task for each website here.
// I assume they all return the same type.
// return .ScrapeUrlAsync().ConfigureAwait(false);
throw new NotImplementedException();
}
public async Task<IEnumerable<ScrapeOdds>> GetOdds()
{
var results = new Collection<ScrapeOdds>();
var urlRequest = ScrappedUrls();
var observerableUrls = urlRequest.Select(u => u.ToObservable()).Merge();
var publisher = observerableUrls.Publish();
var hubContext = GlobalHost.ConnectionManager.GetHubContext<OddsHub>();
publisher.Subscribe(scraper =>
{
// Whatever you do do convert to the result set
var scrapedOdds = scraper.GetOdds();
results.Add(scrapedOdds);
// update anything else you want when it arrives.
// Update SingalR here
hubContext.Clients.All.UpdatedOdds(scrapedOdds);
});
// Will fire off subscriptions and not continue until they are done.
await publisher;
return results;
}
The merge option will process the results as they come in. You can then update the signalR hubs plus whatever else you need to update as they come in. The controller action will have to wait for them all to come in. That's why there is an await on the publisher.
I don't really know if httpClient is going to like to have 50 - 60 web calls all at once or not. If it doesn't you can just take the IEnumerable to an array and break it down into a smaller chunks. And also there should be some error checking in there. With Rx you can also tell it to SubscribeOn and ObserverOn different threads but I think with everything being pretty much async that wouldn't be necessary.

Web response in C# .NET doesn't work more than a couple of times

I am developing an application using twitter api and that involves writing a method to check if a user exists. Here is my code:
public static bool checkUserExists(string user)
{
//string URL = "https://twitter.com/" + user.Trim();
//string URL = "http://api.twitter.com/1/users/show.xml?screen_name=" + user.Trim();
//string URL = "http://google.com/#hl=en&sclient=psy-ab&q=" + user.Trim();
string URL = "http://api.twitter.com/1/users/show.json?screen_name=" + user.Trim();
HttpWebRequest webRequest = (HttpWebRequest)WebRequest.Create(URL);
try
{
var webResponse = (HttpWebResponse)webRequest.GetResponse();
return true;
}
//this part onwards does not matter
catch (WebException ex)
{
if (ex.Status == WebExceptionStatus.ProtocolError && ex.Response != null)
{
var resp = (HttpWebResponse)ex.Response;
if (resp.StatusCode == HttpStatusCode.NotFound)
{
return false;
}
else
{
throw new Exception("Unknown level 1 Exception", ex);
}
}
else
{
throw new Exception("Unknown level 2 Exception", ex);
}
}
}
The problem is, calling the method does not work(it doesn't get a response) more than 2 or 3 times, using any of the urls that have been commented, including the google search query(I thought it might be due to twitter API limit). On debug, it shows that it's stuck at:
var webResponse = (HttpWebResponse)webRequest.GetResponse();
Here's how I am calling it:
Console.WriteLine(TwitterFollowers.checkUserExists("handle1"));
Console.WriteLine(TwitterFollowers.checkUserExists("handle2"));
Console.WriteLine(TwitterFollowers.checkUserExists("handle3"));
Console.WriteLine(TwitterFollowers.checkUserExists("handle4"));
Console.WriteLine(TwitterFollowers.checkUserExists("handle5"));
Console.WriteLine(TwitterFollowers.checkUserExists("handle6"));
At most I get 2-3 lines of output. Could someone please point out what's wrong?
Update 1:
I sent 1 request every 15 seconds (well within limit) and it still causes an error. on the other hand, sending a request, closing the app and running it again works very well (on average accounts to 1 request every 5 seconds). The rate limit is 150 calls per hour Twitter FAQ.
Also, I did wait for a while, and got this exception at level 2:
http://pastie.org/3897499
Update 2:
Might sound surprising but if I run fiddler, it works perfectly. Regardless of whether I target this process or not!

The effect you're seeing is almost certainly due to rate-limit type policies on the Twitter API (multiple requests in quick succession). They keep a tight watch on how you're using their API: the first step is to check their terms of use and policies on rate limiting, and make sure you're in compliance.
Two things jump out at me:
You're hitting the API with multiple requests in rapid succession. Most REST APIs, including Google search, are not going to allow you to do that. These APIs are very visible targets, and it makes sense that they'd be pro-active about preventing denial-of-service attacks.
You don't have a User Agent specified in your request. Most APIs require you to send them a meaningful UA, as a way of helping them identify you.

Note that you're dealing with unmanaged resources underneath your HttpWebResponse. So calling Dispose() in a timely fashion or
wrapping the object in a using statement is not only wise, but important to avoid blocking.
Also, var is great for dealing with anonymous types, Linq query
results, and such but it should not become a crutch. Why use var
when you're well aware of the type? (i.e. you're already performing
a cast to HttpWebResponse.)
Finally, services like this often limit the rate of connections per second and/or the number of simultaneous connections allowed to prevent abuse. By not disposing of your HttpWebResponse objects, you may be violating the permitted number of simultaneous connections. By querying too often you'd break the rate limit.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.