Images take a very long time to load in C# - c#

The problem I have it is:
I tried to download 1000+ images -> it works, but it takes a very long time to load the image downloaded completely, and the program continues and downloads the next image etc... Until let's admit 100 but the 8th image is still not finished downloading.
So I would like to understand why I encounter such a problem here and / or how to fix this problem.
Hope to see an issue
private string DownloadSourceCode(string url)
{
string sourceCode = "";
try
{
using (WebClient WC = new WebClient())
{
WC.Encoding = Encoding.UTF8;
WC.Headers.Add("Accept", "image / webp, */*");
WC.Headers.Add("Accept-Language", "fr, fr - FR");
WC.Headers.Add("Cache-Control", "max-age=1");
WC.Headers.Add("DNT", "1");
WC.Headers.Add("Origin", url);
WC.Headers.Add("TE", "Trailers");
WC.Headers.Add("user-agent", Fichier.LoadUserAgent());
sourceCode = WC.DownloadString(url);
}
}
catch (WebException e)
{
if (e.Status == WebExceptionStatus.ProtocolError)
{
string status = string.Format("{0}", ((HttpWebResponse)e.Response).StatusCode);
LabelID.TextInvoke(string.Format("{0} {1} {2} ", status,
((HttpWebResponse)e.Response).StatusDescription,
((HttpWebResponse)e.Response).Server));
}
}
catch (NotSupportedException a)
{
MessageBox.Show(a.Message);
}
return sourceCode;
}
private void DownloadImage(string URL, string filePath)
{
try
{
using (WebClient WC = new WebClient())
{
WC.Encoding = Encoding.UTF8;
WC.Headers.Add("Accept", "image / webp, */*");
WC.Headers.Add("Accept-Language", "fr, fr - FR");
WC.Headers.Add("Cache-Control", "max-age=1");
WC.Headers.Add("DNT", "1");
WC.Headers.Add("Origin", "https://myprivatesite.fr//" + STARTNBR.ToString());
WC.Headers.Add("user-agent", Fichier.LoadUserAgent());
WC.DownloadFile(URL, filePath);
NBRIMAGESDWLD++;
}
STARTNBR = CheckBoxBack.Checked ? --STARTNBR : ++STARTNBR;
}
catch (IOException)
{
LabelID.TextInvoke("Accès non autorisé au fichier");
}
catch (WebException e)
{
if (e.Status == WebExceptionStatus.ProtocolError)
{
LabelID.TextInvoke(string.Format("{0} / {1} / {2} ", ((HttpWebResponse)e.Response).StatusCode,
((HttpWebResponse)e.Response).StatusDescription,
((HttpWebResponse)e.Response).Server));
}
}
catch (NotSupportedException a)
{
MessageBox.Show(a.Message);
}
}
private void DownloadImages()
{
const string URL = "https://myprivatesite.fr/";
string imageIDURL = string.Concat(URL, STARTNBR);
string sourceCode = DownloadSourceCode(imageIDURL);
if (sourceCode != string.Empty)
{
string imageNameURL = Fichier.GetURLImage(sourceCode);
if (imageNameURL != string.Empty)
{
string imagePath = PATHIMAGES + STARTNBR + ".png";
LabelID.TextInvoke(STARTNBR.ToString());
LabelImageURL.TextInvoke(imageNameURL + "\r");
DownloadImage(imageNameURL, imagePath);
Extension.SaveOptions(STARTNBR, CheckBoxBack.Checked);
}
}
STARTNBR = CheckBoxBack.Checked ? --STARTNBR : ++STARTNBR;
}
// END FUNCTIONS
private void BoutonStartPause_Click(object sender, EventArgs e)
{
if (Fichier.RGBIMAGES != null)
{
if (boutonStartPause.Text == "Start")
{
boutonStartPause.ForeColor = Color.DarkRed;
boutonStartPause.Text = "Pause";
if (myTimer == null)
myTimer = new System.Threading.Timer(_ => new Task(DownloadImages).Start(), null, 0, Trackbar.Value);
}
else if (boutonStartPause.Text == "Pause")
EndTimer();
Extension.SaveOptions(STARTNBR, CheckBoxBack.Checked);
}
}

So I would like to understand why I encounter such a problem here and / or how to fix this problem.
There are probably two reasons I can think of.
Connection/Port Exhaustion
Thread Pool Exhaustion
Connection/Port Exhaustion
This happens when you're attempting to create too many connections at once, or when the connections you made previously have not yet been released. When you use a WebClient the resources it uses sometimes don't get released immediately. This causes a delay between when that object is disposed and the actual time that the next WebClient attempting to use the same port/connection actually gets access to that port.
An example of something that would most likely cause Connection/Port Exhaustion
int i = 1_000;
while(i --> 0)
{
using var Client = new WebClient();
// do some webclient stuff
}
When you create a lot of web clients, which is sometimes necessary due to the inherent lack of concurrency in WebClient. There's a possibility that by the time the next WebClient is instantiated, the port that the last one was using may not be available yet, causing either a delay(while it waits for the port) or worse the next WebClient opening another port/connection. This can cause a never ending list of connections to open causing things to grind to a halt!
Thread Pool Exhaustion
This is caused by trying to create too many Task or Thread objects at once that block their own execution(via Thread.Sleep or a long running operation).
Normally this isn't an issue since the built in TaskScheduler does a really good job of keeping track of a lot of tasks and makes sure that they all get turns to execute their code.
Where this becomes a problem is the TaskScheduler has no context for which tasks are important, or which tasks are going to need more time than others to complete. So therefor when many tasks are processing long running operations, blocking, or throwing exceptions, the TaskScheduler has to wait for those tasks to finish before it can start new ones. If you are particularly unlucky the TaskScheduler can start a bunch of tasks that are all blocking and no tasks can start, even if all the other tasks waiting are small and would complete instantly.
You should generally use as few tasks as possible to increase reliability and avoid thread pool exhaustion.
What you can do
You have a few options to help improve the reliability and performance of this code.
Consider using HttpClient instead. I understand you may be required to use WebClient so I have provided answers using WebClient exclusively.
Consider Requesting multiple downloads/strings within the same task to avoid Thread Pool Exhaustion
Consider using a WebClient helper class that limits the available webclients that can be active at once, and has the ability to keep webclients open if you're going to be accessing the same website multiple times.
WebClient Helper Class
I created a very simple helper class to get you started. This will allow you to create WebClient requests asynchronously without having to worry about creating too many clients at once. The default limit is the number of Cores in the client's processor(this was chosen arbitrarily).
public class ConcurrentWebClient
{
// limits the number of maximum clients able to be opened at once
public static int MaxConcurrentDownloads => Environment.ProcessorCount;
// holds any clients that should be kept open
private static readonly ConcurrentDictionary<string, WebClient> Clients;
// prevents more than the alloted webclients to be open at once
public static readonly SemaphoreSlim Locker;
// allows cancellation of clients
private static CancellationTokenSource TokenSource = new();
static ConcurrentWebClient()
{
Clients = new ConcurrentDictionary<string, WebClient>();
Locker ??= new SemaphoreSlim(MaxConcurrentDownloads, MaxConcurrentDownloads);
}
// creates new clients, or if a name is provided retrieves it from the dictionary so we don't need to create more than we need
private async Task<WebClient> CreateClient(string Name, bool persistent, CancellationToken token)
{
// try to retrieve it from the dictionary before creating a new one
if (Clients.ContainsKey(Name))
{
return Clients[Name];
}
WebClient newClient = new();
if (persistent)
{
// try to add the client to the dict so we can reference it later
while (Clients.TryAdd(Name, newClient) is false)
{
token.ThrowIfCancellationRequested();
// allow other tasks to do work while we wait to add the new client
await Task.Delay(1, token);
}
}
return newClient;
}
// allows sending basic dynamic requests without having to create webclients outside of this class
public async Task<T> NewRequest<T>(Func<WebClient, T> Expression, int? MaxTimeout = null, string Id = null)
{
// make sure we dont have more than the maximum clients open at one time
// 100s was chosen becuase WebClient has a default timeout of 100s
await Locker.WaitAsync(MaxTimeout ?? 100_000, TokenSource.Token);
bool persistent = true;
if (Id is null)
{
persistent = false;
Id = string.Empty;
}
try
{
WebClient client = await CreateClient(Id, persistent, TokenSource.Token);
// run the expression to get the result
T result = await Task.Run<T>(() => Expression(client), TokenSource.Token);
if (persistent is false)
{
// just in case the user disposes of the client or sets it to ull in the expression we should not assume it's not null at this point
client?.Dispose();
}
return result;
}
finally
{
// make sure even if we encounter an error we still
// release the lock
Locker.Release();
}
}
// allows assigning the headers without having to do it for every webclient manually
public static void AssignDefaultHeaders(WebClient client)
{
client.Encoding = System.Text.Encoding.UTF8;
client.Headers.Add("Accept", "image / webp, */*");
client.Headers.Add("Accept-Language", "fr, fr - FR");
client.Headers.Add("Cache-Control", "max-age=1");
client.Headers.Add("DNT", "1");
// i have no clue what Fichier is so this was not tested
client.Headers.Add("user-agent", Fichier.LoadUserAgent());
}
// cancels a webclient by name, whether its being used or not
public async Task Cancel(string Name)
{
// look to see if we can find the client
if (Clients.ContainsKey(Name))
{
// get a token incase we have to emergency cance
CancellationToken token = TokenSource.Token;
// try to get the client from the dictionary
WebClient foundClient = null;
while (Clients.TryGetValue(Name, out foundClient) is false)
{
token.ThrowIfCancellationRequested();
// allow other tasks to perform work while we wait to get the value from the dictionary
await Task.Delay(1, token);
}
// if we found the client we should cancel and dispose of it so it's resources gets freed
if (foundClient != null)
{
foundClient?.CancelAsync();
foundClient?.Dispose();
}
}
}
// the emergency stop button
public void ForceCancelAll()
{
// this will throw lots of OperationCancelledException, be prepared to catch them, they're fast.
TokenSource?.Cancel();
TokenSource?.Dispose();
TokenSource = new();
foreach (var item in Clients)
{
item.Value?.CancelAsync();
item.Value?.Dispose();
}
Clients.Clear();
}
}
Request Multiple Things at Once
Here all I did was switch to using the helper class, and made it so you can request multiple things using the same connection
public async Task<string[]> DownloadSourceCode(string[] urls)
{
var downloader = new ConcurrentWebClient();
return await downloader.NewRequest<string[]>((WebClient client) =>
{
ConcurrentWebClient.AssignDefaultHeaders(client);
client.Headers.Add("TE", "Trailers");
string[] result = new string[urls.Length];
for (int i = 0; i < urls.Length; i++)
{
string url = urls[i];
client.Headers.Remove("Origin");
client.Headers.Add("Origin", url);
result[i] = client.DownloadString(url);
}
return result;
});
}
private async Task<bool> DownloadImage(string[] URLs, string[] filePaths)
{
var downloader = new ConcurrentWebClient();
bool downloadsSucessful = await downloader.NewRequest<bool>((WebClient client) =>
{
ConcurrentWebClient.AssignDefaultHeaders(client);
int len = Math.Min(URLs.Length, filePaths.Length);
for (int i = 0; i < len; i++)
{
// side-note, this is assuming the websites you're visiting aren't mutating the headers
client.Headers.Remove("Origin");
client.Headers.Add("Origin", "https://myprivatesite.fr//" + STARTNBR.ToString());
client.DownloadFile(URLs[i], filePaths[i]);
NBRIMAGESDWLD++;
STARTNBR = CheckBoxBack.Checked ? --STARTNBR : ++STARTNBR;
}
return true;
});
return downloadsSucessful;
}

Related

Unity C# - Execute async task within a Coroutine and wait for it to finish

I'm completely new to Unity development and I'm trying to integrate an asynchronous functionality into an existing Coroutine but I have faced several issues so far.
My issues:
The Unity app completely freezes, probably because it was blocked by a thread. I implemented this code in traditional C# (console app) without having any issues.(Fixed now in Unity after some modifications)
The download task begins but it never finishes . This happens only when I run it as APK. On Unity debugging on PC works fine.
My code:
public void Validate()
{
StartCoroutine(DoWork());
}
private IEnumerator DoWork()
{
bool success;
//Perform some calculations here
..
..
success = true;
//
yield return new WaitForSeconds(3);
if(success){
GetInfo(_files);
}
}
async void GetInfo(List<string> files)
{
await StartDownload(files);
//After completing the download operation, perform some other actions in the background
...
...
//
//When done, change the active status of specific game objects
}
public async Task StartDownload(List<string> files){
var t1 = GetFileSizesAsync(files);
var t2 = DownloadFilesAsync(files);
await Task.WhenAll(t1, t2);
}
public async Task GetFileSizesAsync(List<string> urls)
{
foreach (var url in urls)
GetFileSize(url);
txtSize.text = totalSizeMb +"MB";
}
private void GetFileSize(string url)
{
var uri = new Uri(url);
var webRequest = HttpWebRequest.Create(uri);
webRequest.Method = "GET";
try
{
var webResponse = webRequest.GetResponse();
var fileSize = webResponse.Headers.Get("Content-Length");
var fileSizeInMegaByte = Math.Round(Convert.ToDouble(fileSize) / 1024.0 / 1024.0, 2);
totalSizeMb = totalSizeMb + fileSizeInMegaByte;
}
catch (Exception ex)
{
}
finally
{
webRequest.Abort();
}
}
public async Task<List<string>> DownloadFilesAsync(List<string> urls)
{
var result = new List<string>();
foreach (var url in urls)
var download = await DownloadFile(url);
if(download)
result.Add(url);
return response;
}
private async Task<bool> DownloadFile(string url)
{
WebClient webClient = new WebClient();
var uri = new Uri(url);
var saveHere = "C:\\...";
try
{
await webClient.DownloadFileTaskAsync(uri, saveHere);
return true;
}
catch (Exception ex)
{
return false;
}
}
Can you tell me what I'm doing wrong here? I've tried several ways but couldn't manage to find a proper solution.
Thank you!
I would rather make it
async Task GetInfo(List<string> files){ ... }
And then in your Coroutine do
var t = Task.Run(async () => await GetInfo(files));
while(!t.IsCompleted)
{
yield return null;
}
// Or also
//yield return new WaitUntil(() => t.IsCompleted);
if(!t.IsCompletedSuccesfully)
{
Debug.LogError("Task failed or canceled!");
yield break;
}
Note however:
When done, change the active status of specific game objects
This can't be done async! It has to happen in the Unity main thread! Therefore you would probably rather return something from your GetInfo task and activate the objects in the Coroutine when done.
after the loop and yielding you could then access the return value via
var result = t.Result;
Your web requests are currently totally redundant! You start get requests only to check how big the received content is but immediately throw away that received content ... then you start a second request to actually download the files (again).
In general I would recommend to rather use a UnityWebRequest.Get you can directly yield in the Coroutine in combination with a DownloadHandlerFile which allows you to directly download the content into a local file instead of into runtime memory.
Also
var saveHere = "C:\\...";
is hopefully not what you are trying to use as path on Android ;)
First of all.
The freezing is definitely being done by the IEneumerator as I don't see a yield statement. What's a yield statement? Uhhhh...Google it. lel
Second of all.
Wouldn't a normal void work just fine?
Third of all.
I don't know much about async's since I'm pretty new to them
but I'm fairly certain you don't need them for:
Task GetFileSizesAsync(List urls)
AND
async Task<List> DownloadFilesAsync(List urls)
I may be wrong tho.

Issue With HttpClient Bulk Parallel Request in .Net Core C#

So I have been struggling with this issue for like 3 weeks. Here's what I want to do.
So I have like 2000 stock options. I want to fetch 5 of them at a time and process but it all has to be parallel. I'll write them in steps to make it more clear.
Get 5 stock symbols from an array
Send it to fetch its data and process. Don't wait for a response keep on processing.
wait 2.6 seconds (as we are limited to 120 API requests per minute so this delay helps in keeping it throttled to 115 per minute)
Goto 1
All the steps above have to be parallel. I have written the code for it and it all seems to be working fine but randomly it crashes saying
"A connection attempt failed because the connected party did not
properly respond after a period of time, or established connection
failed because connected host has failed to respond".
And sometimes it'll never happen and everything works like a charm.
This error is very random. It could show up on maybe 57th stock or maybe at 1829th stock. I have used HttpClient for it. I have tested this same scenario using Angular and creating custom requests and it never crashes there so it's not third-party server's fault.
What I have already done:
Changed HttpClient class usage from new instances every time to a single instance for the whole project.
Increases Service point manager Connection limit to a different number. (Default for .net core is 2)
Instead of HttpClient Queuing I have used SemaphoreSlim for queue and short-circuiting.
Forced ConnectionLeaseTimeout to 40 seconds to detect DNS changes if any.
Changed Async Tasks to threading.
Tried almost everything from the internet.
My doubts:
I doubt that it has something to do with the HttpClient class. I have read a lot of bad things about its misleading documentation etc.
My friend's doubt:
He said it could be because of concurrent tasks and I should change it to threads.
Here's the code:
// Inside Class Constructor
private readonly HttpClient HttpClient = new HttpClient();
SetMaxConcurrency(ApiBaseUrl, maxConcurrentRequests);
// SetMaxConcurrency function
private void SetMaxConcurrency(string url, int maxConcurrentRequests)
{
ServicePointManager.FindServicePoint(new Uri(url)).ConnectionLimit = maxConcurrentRequests;
ServicePointManager.FindServicePoint(new Uri(url)).ConnectionLeaseTimeout = 40*1000;
}
// code for looping through chunks of symbol each chunk has 5 symbols/stocks in it
foreach(var chunkedSymbol in chunkedSymbols)
{
//getting o auth token
string AuthToken = await OAuth();
if(String.IsNullOrEmpty(AuthToken))
{
throw new ArgumentNullException("Access Token is null!");
}
processingSymbols += chunkSize;
OptionChainReq.symbol = chunkedSymbol.ToArray();
async Task func()
{
//function that makes request
var response = await GetOptionChain(AuthToken, ClientId, OptionChainReq);
// concat the result in main list
appResponses = appResponses.Concat(response).ToList();
}
// if request reaches 115 process the remaning requests first
if(processingSymbols >= 115)
{
await Task.WhenAll(tasks);
processingSymbols = 0;
}
tasks.Add(func());
// 2600 millisecond delay to wait for all the data to process
await Task.Delay(delay);
}
//once the loop is completed process the remaining requests
await Task.WhenAll(tasks);
// This code processes every symbol. this code is inside GetOptionChain()
try{
var tasks = new List<Task>();
foreach (string symbol in OptionChainReq.symbol)
{
List<KeyValuePair<string, string>> Params = new List<KeyValuePair<string, string>>();
string requestParams = string.Empty;
// Converting Request Params to Key Value Pair.
Params.Add(new KeyValuePair<string, string>("apikey" , ClientId));
// URL Request Query parameters.
requestParams = new FormUrlEncodedContent(Params).ReadAsStringAsync().Result;
string endpoint = ApiBaseUrl + "/marketdata/chains?";
HttpClient.DefaultRequestHeaders.Authorization = new AuthenticationHeaderValue("Bearer", OAuthToken);
Uri tosUri = new Uri(endpoint + requestParams, UriKind.Absolute);
async Task func()
{
try{
string responseString = await GetTosData(tosUri);
OptionChainResponse OptionChainRes = JsonConvert.DeserializeObject<OptionChainResponse>(responseString);
var mappedOptionAppRes = MapOptionsAppRes( OptionChainRes );
if(mappedOptionAppRes != null)
{
OptionsData.Add( mappedOptionAppRes );
}
}
catch(Exception ex)
{
throw new Exception("Crashed");
}
}
// asyncronusly processing each request
tasks.Add(func());
}
//making sure all 5 requests are processed
await Task.WhenAll(tasks);
}
catch (Exception ex)
{
failedSymbols += " "+ string.Join(",", OptionChainReq.symbol);
}
// The code below is for individual request
public async Task<string> GetTosData(Uri url)
{
try
{
await semaphore.WaitAsync();
if (IsTripped())
{
return UNAVAILABLE;
}
var response = await HttpClient.GetAsync(url);
if(response.StatusCode == System.Net.HttpStatusCode.Unauthorized)
{
string OAuthToken = await OAuth();
HttpClient.DefaultRequestHeaders.Authorization = new AuthenticationHeaderValue("Bearer", OAuthToken);
return await GetTosData(url);
}
else if(response.StatusCode != HttpStatusCode.OK)
{
TripCircuit(reason: $"Status not OK. Status={response.StatusCode}");
return UNAVAILABLE;
}
return await response.Content.ReadAsStringAsync();
}
catch(Exception ex) when (ex is OperationCanceledException || ex is TaskCanceledException)
{
Console.WriteLine("Timed out");
TripCircuit(reason: $"Timed out");
return UNAVAILABLE;
}
finally
{
semaphore.Release();
}
}

Detect if webclient's DownloadComplete event handler execution finished?

I am new to the whole Async and Threading world of the programming. And I am stuck at one problem. Following code is simplified version for the better understanding.
What I am trying to do the three things,
1) Hit api in loop using WebClient and it's Async method and start downloading the data
2) While downloading the api data use that time to process the other data and calculate some values
3) Make sure all downloading is completed and then process downloaded data and save to the file and database
I am able to achieve 2 steps but In 3rd step I am not sure how I can detect if all download is completed or not so Googled and found this but the problem with that it requires .net 4.5 and I am working on the .net 4.0. So basically I need solution that will help me figure out how to detect if download of all api calls is completed.
There is one way that use loop call count to match the completed data item list count but what if only one or two api calls get error in that case it will wait indefinitely.
Below is my code,
class Program
{
public static List<StackRoot> AllQuestionRoot = new List<StackRoot>();
static void Main(string[] args)
{
MyWebClient client = new MyWebClient();
try
{
for (int i = 0; i < 10; i++)
{
client.DownloadStringCompleted += new DownloadStringCompletedEventHandler(HandleQuestionDownloadCompleted);
client.DownloadStringAsync(new Uri("http://api.stackexchange.com/2.2/questions?page=1&pagesize=20&order=desc&sort=creation&tagged=reporting-services&site=stackoverflow"), waiter);
}
}
catch (WebException exception)
{
string responseText;
using (var reader = new StreamReader(exception.Response.GetResponseStream()))
{
responseText = reader.ReadToEnd();
}
}
//Do some other stuff
//calculate values
//How to make sure all my asynch DownloadStringCompleted calls are completed ?
//process AllQuestionRoot data depending on some values calculated above
//save the AllQuestionRoot to database and directory
Console.ReadKey();
}
private static void HandleQuestionDownloadCompleted(object sender, DownloadStringCompletedEventArgs e)
{
if (e.Error == null || !e.Cancelled)
{
StackRoot responseRoot = JsonConvert.DeserializeObject<StackRoot>(e.Result);
AllQuestionRoot.Add(responseRoot);
}
}
}
Feel free to comment in case of confusion . If there is any other way to achieve what I am doing then please free to mention. No need to follow my approach, If you have any-other please free to comment.Any pointers to words the answers will be great.
As a side note, you can use Microsoft Async to use async-awit on .NET 4.0
So, you need to have some way to wait for the end of a series of "tasks".
Since you seem to know how many "tasks" you have, a CountdownEvent is a good fit:
class Program
{
public static List<StackRoot> AllQuestionRoot = new List<StackRoot>();
public static object criticalSection = new object();
public static CountdownEvent countdown = new CountdownEvent(10);
static void Main(string[] args)
{
MyWebClient client = new MyWebClient();
try
{
for (int i = 0; i < 10; i++)
{
client.DownloadStringCompleted += new DownloadStringCompletedEventHandler(HandleQuestionDownloadCompleted);
client.DownloadStringAsync(new Uri("http://api.stackexchange.com/2.2/questions?page=1&pagesize=20&order=desc&sort=creation&tagged=reporting-services&site=stackoverflow"), waiter);
}
}
catch (WebException exception)
{
string responseText;
using (var reader = new StreamReader(exception.Response.GetResponseStream()))
{
responseText = reader.ReadToEnd();
}
}
//Do some other stuff
//calculate values
//How to make sure all my asynch DownloadStringCompleted calls are completed ?
//process AllQuestionRoot data depending on some values calculated above
//save the AllQuestionRoot to database and directory
// Wait until all have been completed.
countdown.Wait();
Console.ReadKey();
}
private static void HandleQuestionDownloadCompleted(object sender, DownloadStringCompletedEventArgs e)
{
if (e.Error == null || !e.Cancelled)
{
StackRoot responseRoot = JsonConvert.DeserializeObject<StackRoot>(e.Result);
// Adding to List<T> is not thread safe.
lock (criticalSection)
{
AllQuestionRoot.Add(responseRoot);
}
// Signal completed.
countdown.Signal();
}
}
}

Combining a while loop with Task.Run() in C#

I'm pretty new to multithread applications in C# and I'm trying to edit my code below so that it runs on multiple threads. Right now it operates synchronously and it takes up very little cpu power. I need it to run much faster on multiple threads. My thought was starting a task for each core and then when a task finishes, allow another to take its place or something like that if it is possible.
static void Main(string[] args)
{
string connectionString = CloudConfigurationManager.GetSetting("Microsoft.ServiceBus.ConnectionString");
QueueClient Client = QueueClient.CreateFromConnectionString(connectionString, "OoplesQueue");
try
{
while (true)
{
Task.Run(() => processCalculations(Client));
}
}
catch (Exception ex)
{
Console.WriteLine(ex.Message);
Console.WriteLine(ex.StackTrace);
}
}
public static ConnectionMultiplexer connection;
public static IDatabase cache;
public static async Task processCalculations(QueueClient client)
{
try
{
BrokeredMessage message = await client.ReceiveAsync();
if (message != null)
{
if (connection == null || !connection.IsConnected)
{
connection = await ConnectionMultiplexer.ConnectAsync("connection,SyncTimeout=10000,ConnectTimeout=10000");
//connection = ConnectionMultiplexer.Connect("connection,SyncTimeout=10000,ConnectTimeout=10000");
}
cache = connection.GetDatabase();
string sandpKey = message.Properties["sandp"].ToString();
string dateKey = message.Properties["date"].ToString();
string symbolclassKey = message.Properties["symbolclass"].ToString();
string stockdataKey = message.Properties["stockdata"].ToString();
string stockcomparedataKey = message.Properties["stockcomparedata"].ToString();
List<StockData> sandp = cache.Get<List<StockData>>(sandpKey);
DateTime date = cache.Get<DateTime>(dateKey);
SymbolInfo symbolinfo = cache.Get<SymbolInfo>(symbolclassKey);
List<StockData> stockdata = cache.Get<List<StockData>>(stockdataKey);
List<StockMarketCompare> stockcomparedata = cache.Get<List<StockMarketCompare>>(stockcomparedataKey);
StockRating rating = performCalculations(symbolinfo, date, sandp, stockdata, stockcomparedata);
if (rating != null)
{
saveToTable(rating);
if (message.LockedUntilUtc.Minute <= 1)
{
await message.RenewLockAsync();
}
await message.CompleteAsync();
}
else
{
Console.WriteLine("Message " + message.MessageId + " Completed!");
await message.CompleteAsync();
}
}
}
catch (TimeoutException time)
{
Console.WriteLine(time.Message);
}
catch (MessageLockLostException locks)
{
Console.WriteLine(locks.Message);
}
catch (RedisConnectionException redis)
{
Console.WriteLine("Start the redis server service!");
}
catch (MessagingCommunicationException communication)
{
Console.WriteLine(communication.Message);
}
catch (Exception ex)
{
Console.WriteLine(ex.Message);
Console.WriteLine(ex.StackTrace);
}
}
This looks like a classic producer-consumer pattern.
In this case, where you need concurrency combined with async IO bound operations (such as retrieving data from a Redis cache) and CPU bound operations (such as doing compute bound calculations), i'd leverage TPL Dataflow for the job.
You can use a ActionBlock<T> which is responsible for processing of a single action you pass to it. Behind the scenes, it takes care of concurrency, while you can limit it as you want by passing it an ExecutionDataflowBlockOptions.
You start off by creating the ActionBlock<BrokeredMessage>:
private static void Main(string[] args)
{
var actionBlock = new ActionBlock<BrokeredMessage>(async message =>
await ProcessCalculationsAsync(message),
new ExecutionDataflowBlockOptions
{
MaxDegreeOfParallelism = Environment.ProcessorCount
});
var produceMessagesTask = Task.Run(async () => await
ProduceBrokeredMessagesAsync(client,
actionBlock));
produceMessagesTask.Wait();
}
Now lets look what ProduceBrokeredMessageAsync. It simply receives your QueueClient and the ActionBlock to the the following:
private async Task ProduceBrokeredMessagesAsync(QueueClient client,
ActionBlock<BrokeredMessage> actionBlock)
{
BrokeredMessage message;
while ((message = await client.ReceiveAsync()) != null)
{
await actionBlock.SendAsync(message);
}
actionBlock.Complete();
await actionBlock.Completion;
}
What this does is while you receive messages from your QueueClient, it will asynchronously post the message to the ActionBlock, which will process those message concurrently.
Right now it operates synchronously and it takes up very little cpu power. I need it to run much faster on multiple threads.
"Multiple threads" doesn't necessarily mean "faster". That is only true if you have multiple calculations to perform that are independent of each other, and they are CPU-bound (meaning they mainly involve CPU operations, not IO operations).
Additionally, async doesn't necessarily mean multiple threads. It just means your operation is not blocking a process thread while in progress. If you're starting another thread and blocking it, then that looks like async but it really isn't. Check out this Channel 9 video: Async Library Methods Shouldn't Lie
Most of your operations in processCalculations look like they are dependent on each other; however, this part might be a potential improvement point:
List<StockData> sandp = cache.Get<List<StockData>>(sandpKey);
DateTime date = cache.Get<DateTime>(dateKey);
SymbolInfo symbolinfo = cache.Get<SymbolInfo>(symbolclassKey);
List<StockData> stockdata = cache.Get<List<StockData>>(stockdataKey);
List<StockMarketCompare> stockcomparedata = cache.Get<List<StockMarketCompare>>(stockcomparedataKey);
StockRating rating = performCalculations(symbolinfo, date, sandp, stockdata, stockcomparedata);
I'm not familiar with the API you're using but IF it includes an async equivalent of the Get method you might be able to do those IO operations asynchronously in parallel, e.g.:
var sandpTask = List<StockData> sandp = cache.GetAsync<List<StockData>>(sandpKey);
var dateTask = cache.GetAsync<DateTime>(dateKey);
var symbolinfoTask = cache.GetAsync<SymbolInfo>(symbolclassKey);
var stockdataTask = cache.GetAsync<List<StockData>>(stockdataKey);
var stockcomparedataTask = cache.GetAsync<List<StockMarketCompare>>(stockcomparedataKey);
await Task.WhenAll(sandpTask, dateTask,symbolinfoTask,
stockdataTask, stockcomparedataTask);
List<StockData> sandp = sandpTask.Result;
DateTime date = dateTask.Result;
SymbolInfo symbolinfo = symbolinfoTask.Result;
List<StockData> stockdata = stockdataTask.Result;
List<StockMarketCompare> stockcomparedata = stockcomparedataTask.Result;
StockRating rating = performCalculations(symbolinfo, date, sandp, stockdata, stockcomparedata);
Also, note that you don't need to wrap the processCalculations call in another Task since it already returns a task:
// instead of Task.Run(() => processCalculations(message));
processCalculations(message);
You need two parts:
Part 1 waits for an incoming message: ConnectAsync() this runs in a simple loop. Whenever something is received an instance of Part2 is started to process the incoming message.
Part2 runs in another thread / in the background and processes a single incoming message.
That way several instances of Part2 may run in parallel.
So your structure is like this:
while (true)
{
connection = await ConnectionMultiplexer.ConnectAsync(...);
StartProcessCalculationsInBackground(connection, ...); // return immediately
}

Mass Downloading of Webpages C#

My application requires that I download a large amount of webpages into memory for further parsing and processing. What is the fastest way to do it? My current method (shown below) seems to be too slow and occasionally results in timeouts.
for (int i = 1; i<=pages; i++)
{
string page_specific_link = baseurl + "&page=" + i.ToString();
try
{
WebClient client = new WebClient();
var pagesource = client.DownloadString(page_specific_link);
client.Dispose();
sourcelist.Add(pagesource);
}
catch (Exception)
{
}
}
The way you approach this problem is going to depend very much on how many pages you want to download, and how many sites you're referencing.
I'll use a good round number like 1,000. If you want to download that many pages from a single site, it's going to take a lot longer than if you want to download 1,000 pages that are spread out across dozens or hundreds of sites. The reason is that if you hit a single site with a whole bunch of concurrent requests, you'll probably end up getting blocked.
So you have to implement a type of "politeness policy," that issues a delay between multiple requests on a single site. The length of that delay depends on a number of things. If the site's robots.txt file has a crawl-delay entry, you should respect that. If they don't want you accessing more than one page per minute, then that's as fast as you should crawl. If there's no crawl-delay, you should base your delay on how long it takes a site to respond. For example, if you can download a page from the site in 500 milliseconds, you set your delay to X. If it takes a full second, set your delay to 2X. You can probably cap your delay to 60 seconds (unless crawl-delay is longer), and I would recommend that you set a minimum delay of 5 to 10 seconds.
I wouldn't recommend using Parallel.ForEach for this. My testing has shown that it doesn't do a good job. Sometimes it over-taxes the connection and often it doesn't allow enough concurrent connections. I would instead create a queue of WebClient instances and then write something like:
// Create queue of WebClient instances
BlockingCollection<WebClient> ClientQueue = new BlockingCollection<WebClient>();
// Initialize queue with some number of WebClient instances
// now process urls
foreach (var url in urls_to_download)
{
var worker = ClientQueue.Take();
worker.DownloadStringAsync(url, ...);
}
When you initialize the WebClient instances that go into the queue, set their OnDownloadStringCompleted event handlers to point to a completed event handler. That handler should save the string to a file (or perhaps you should just use DownloadFileAsync), and then the client, adds itself back to the ClientQueue.
In my testing, I've been able to support 10 to 15 concurrent connections with this method. Any more than that and I run into problems with DNS resolution (`DownloadStringAsync' doesn't do the DNS resolution asynchronously). You can get more connections, but doing so is a lot of work.
That's the approach I've taken in the past, and it's worked very well for downloading thousands of pages quickly. It's definitely not the approach I took with my high performance Web crawler, though.
I should also note that there is a huge difference in resource usage between these two blocks of code:
WebClient MyWebClient = new WebClient();
foreach (var url in urls_to_download)
{
MyWebClient.DownloadString(url);
}
---------------
foreach (var url in urls_to_download)
{
WebClient MyWebClient = new WebClient();
MyWebClient.DownloadString(url);
}
The first allocates a single WebClient instance that is used for all requests. The second allocates one WebClient for each request. The difference is huge. WebClient uses a lot of system resources, and allocating thousands of them in a relatively short time is going to impact performance. Believe me ... I've run into this. You're better off allocating just 10 or 20 WebClients (as many as you need for concurrent processing), rather than allocating one per request.
Why not just use a web crawling framework. It can handle all the stuff for you like (multithreading, httprequests, parsing links, scheduling, politeness, etc..).
Abot (https://code.google.com/p/abot/) handles all that stuff for you and is written in c#.
In addition to #Davids perfectly valid answer, I want to add a slightly cleaner "version" of his approach.
var pages = new List<string> { "http://bing.com", "http://stackoverflow.com" };
var sources = new BlockingCollection<string>();
Parallel.ForEach(pages, x =>
{
using(var client = new WebClient())
{
var pagesource = client.DownloadString(x);
sources.Add(pagesource);
}
});
Yet another approach, that uses async:
static IEnumerable<string> GetSources(List<string> pages)
{
var sources = new BlockingCollection<string>();
var latch = new CountdownEvent(pages.Count);
foreach (var p in pages)
{
using (var wc = new WebClient())
{
wc.DownloadStringCompleted += (x, e) =>
{
sources.Add(e.Result);
latch.Signal();
};
wc.DownloadStringAsync(new Uri(p));
}
}
latch.Wait();
return sources;
}
You should use parallel programming for this purpose.
There are a lot of ways to achieve what u want; the easiest would be something like this:
var pageList = new List<string>();
for (int i = 1; i <= pages; i++)
{
pageList.Add(baseurl + "&page=" + i.ToString());
}
// pageList is a list of urls
Parallel.ForEach<string>(pageList, (page) =>
{
try
{
WebClient client = new WebClient();
var pagesource = client.DownloadString(page);
client.Dispose();
lock (sourcelist)
sourcelist.Add(pagesource);
}
catch (Exception) {}
});
I Had a similar Case ,and that's how i solved
using System;
using System.Threading;
using System.Collections.Generic;
using System.Net;
using System.IO;
namespace WebClientApp
{
class MainClassApp
{
private static int requests = 0;
private static object requests_lock = new object();
public static void Main() {
List<string> urls = new List<string> { "http://www.google.com", "http://www.slashdot.org"};
foreach(var url in urls) {
ThreadPool.QueueUserWorkItem(GetUrl, url);
}
int cur_req = 0;
while(cur_req<urls.Count) {
lock(requests_lock) {
cur_req = requests;
}
Thread.Sleep(1000);
}
Console.WriteLine("Done");
}
private static void GetUrl(Object the_url) {
string url = (string)the_url;
WebClient client = new WebClient();
Stream data = client.OpenRead (url);
StreamReader reader = new StreamReader(data);
string html = reader.ReadToEnd ();
/// Do something with html
Console.WriteLine(html);
lock(requests_lock) {
//Maybe you could add here the HTML to SourceList
requests++;
}
}
}
You should think using Paralel's because the slow speed is because you're software is waiting for I/O and why not while a thread i waiting for I/O another one get started.
While the other answers are perfectly valid, all of them (at the time of this writing) are neglecting something very important: calls to the web are IO bound, having a thread wait on an operation like this is going to strain system resources and have an impact on your system resources.
What you really want to do is take advantage of the async methods on the WebClient class (as some have pointed out) as well as the Task Parallel Library's ability to handle the Event-Based Asynchronous Pattern.
First, you would get the urls that you want to download:
IEnumerable<Uri> urls = pages.Select(i => new Uri(baseurl +
"&page=" + i.ToString(CultureInfo.InvariantCulture)));
Then, you would create a new WebClient instance for each url, using the TaskCompletionSource<T> class to handle the calls asynchronously (this won't burn a thread):
IEnumerable<Task<Tuple<Uri, string>> tasks = urls.Select(url => {
// Create the task completion source.
var tcs = new TaskCompletionSource<Tuple<Uri, string>>();
// The web client.
var wc = new WebClient();
// Attach to the DownloadStringCompleted event.
client.DownloadStringCompleted += (s, e) => {
// Dispose of the client when done.
using (wc)
{
// If there is an error, set it.
if (e.Error != null)
{
tcs.SetException(e.Error);
}
// Otherwise, set cancelled if cancelled.
else if (e.Cancelled)
{
tcs.SetCanceled();
}
else
{
// Set the result.
tcs.SetResult(new Tuple<string, string>(url, e.Result));
}
}
};
// Start the process asynchronously, don't burn a thread.
wc.DownloadStringAsync(url);
// Return the task.
return tcs.Task;
});
Now you have an IEnumerable<T> which you can convert to an array and wait on all of the results using Task.WaitAll:
// Materialize the tasks.
Task<Tuple<Uri, string>> materializedTasks = tasks.ToArray();
// Wait for all to complete.
Task.WaitAll(materializedTasks);
Then, you can just use Result property on the Task<T> instances to get the pair of the url and the content:
// Cycle through each of the results.
foreach (Tuple<Uri, string> pair in materializedTasks.Select(t => t.Result))
{
// pair.Item1 will contain the Uri.
// pair.Item2 will contain the content.
}
Note that the above code has the caveat of not having an error handling.
If you wanted to get even more throughput, instead of waiting for the entire list to be finished, you could process the content of a single page after it's done downloading; Task<T> is meant to be used like a pipeline, when you've completed your unit of work, have it continue to the next one instead of waiting for all of the items to be done (if they can be done in an asynchronous manner).
I am using an active Threads count and a arbitrary limit:
private static volatile int activeThreads = 0;
public static void RecordData()
{
var nbThreads = 10;
var source = db.ListOfUrls; // Thousands urls
var iterations = source.Length / groupSize;
for (int i = 0; i < iterations; i++)
{
var subList = source.Skip(groupSize* i).Take(groupSize);
Parallel.ForEach(subList, (item) => RecordUri(item));
//I want to wait here until process further data to avoid overload
while (activeThreads > 30) Thread.Sleep(100);
}
}
private static async Task RecordUri(Uri uri)
{
using (WebClient wc = new WebClient())
{
Interlocked.Increment(ref activeThreads);
wc.DownloadStringCompleted += (sender, e) => Interlocked.Decrement(ref iterationsCount);
var jsonData = "";
RootObject root;
jsonData = await wc.DownloadStringTaskAsync(uri);
var root = JsonConvert.DeserializeObject<RootObject>(jsonData);
RecordData(root)
}
}

Categories