web crawling in a multi threaded console app - c#

New to multi-threded apps.
I am trying to create a console app to check a given list of IP addresses (intranet). Each web page for any given IP address contains some stats, displayed in an html table, that I need to collect.
I can do this in a single thread: set up the request/response sequence, get the page content and parse it.
What I am struggling with right now is to make this multi-threaded since I have to deal with 4000 IP addresses and single thread would take some time. I have the list of IPs in a list or array of strings; do you know how I can set up the threads?
Assuming I have a function that processes the response, say, "ProcessResponse(string s)", and want to start with 10 threads, can I start with something like:
public class PASSServer
{
private string _ip;
public string IPAddress
{
get;
set;
}
public PASSServer()
{
}
}
static void Main(string[] args)
{
int iNumThreads = 3;
Thread[] threads = new Thread[iNumThreads];
string[] sIPs = { "192.168.10.20", "192.168.10.21", "192.168.10.22" };
for (int i = 0; i < threads.Length; i++)
{
ParameterizedThreadStart start = new ParameterizedThreadStart(Start);
threads[i] = new Thread(start);
PASSServer pserver = new PASSServer();
pserver.IPAddress = sIPs[i];
threads[i].Start(pserver);
}
Console.WriteLine("DONE");
Console.ReadKey();
}
static void Start(object info)
{
PASSServer pserver = (PASSServer)info;
crawl(pserver.IPAddress);
}
private static void crawl(string sUrl)
{
PASSData cData = new PASSData();
string sRequestUrl = "http://" + sUrl.Trim() + "/cgi-bin/sysstat?";
string sEncodingType = "utf-8";
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(sRequestUrl);
request.KeepAlive = true;
request.Timeout = 15 * 1000;
System.Net.HttpWebResponse response = (HttpWebResponse)request.GetResponse();
string sStatus = ((HttpWebResponse)response).StatusDescription;
sEncodingType = GetEncodingType(response);
System.IO.StreamReader reader = new System.IO.StreamReader(response.GetResponseStream(), Encoding.GetEncoding(sEncodingType));
// Read the content.
string responseFromServer = reader.ReadToEnd();
Console.WriteLine(responseFromServer);
}
Any help is greatly appreciated.
I have not used multi threading but googled the subject and got some ideas just not sure how best to set up my scenario.

Don't use threads. Use asynchronous HTTP requests. For example, use HttpWebRequest.BeginGetResponse or perhaps HttpWebRequest.GetResponseAsync. Limit the number of concurrent requests using a Semaphore.
So, if you have a list of URLs (a List<string>) and you want a maximum of 10 concurrent requests:
List<string> _urls = GetListOfUrls();
Semaphore _requestSemaphore = new Semaphore(10, 10);
foreach (var url in _urls)
{
// wait for an available spot
_requestSemaphore.WaitOne();
// Now start an asynchronous request with this url
var request = (HttpWebRequest)WebRequest.Create(url);
request.BeginGetResponse(GetResponseCallback, request);
}
When your list is empty, you have to wait for the final responses to be received. The way you do that is to wait on the semaphore 10 times. When you've got 10, then there can't be any outstanding requests:
for (int i = 0; i < 10; ++i)
{
_requestSemaphore.WaitOne();
}
And your callback, which is called when a response is received:
void GetResponseCallback(IAsyncResult ar)
{
var request = (HttpWebRequest)ar.AsyncState;
var response = (HttpWebResponse)request.EndGetResponse(ar);
// process the response here.
// when you're done processing the response, release the semaphore
_requestSemaphore.Release();
}

I would loop through your list of IP addresses and start a ThreadPool work item.
foreach(string addr in IpAddresses)
Threading.ThreadPool.QueueUserWorkItem(
(string ipaddr) =>
{
ResponseFromQuery resp = new ResponseFromQuery();
this.BeginInvoke(new MethodInvoker(() => { UpdateTable(resp); }));
}, addr);
*EDIT: Above, you will need to call BeginInvoke and create a methodinvoker that calls back to a new method in your application call UpdateTable. You can pass in your response information (whatever type it is, I used a made up ResponseFromQuery class for example).
You can use either an anonymous function or, if there is a lot of code and you might use it elsewhere, you could create a processing class and method that you can pass as your method that you want executed.
If you wanted to manage your threads yourself, you can create a Dictionary or List object and add a thread to that for each item in your collection:
Dictionary<string, Thread> _threads = new Dictionary<string, Thread>();
foreach (string addr in IpAddresses)
{
_threads.Add(addr, new System.Threading.Thread(
new System.Threading.ParameterizedThreadStart(
(object ip) =>
{
// process ip.
}, addr)));
_threads[addr].Start();
}

Related

Ping with multithreading

I did this:
WebClient client = new WebClient();
string[] dns = client.DownloadString("https://public-dns.info/nameservers.txt")
.Split('\n');
List<string> parsedDns = new List<string>();
foreach (string dnsStr in dns)
{
Ping ping = new Ping();
if (dnsStr.Contains(":"))
{
}
else if (ping.SendPingAsync(dnsStr, 150).Result.RoundtripTime <= 150)
{
parsedDns.Add(dnsStr);
}
}
foreach (var dns_ in parsedDns.ToArray())
{
Console.WriteLine(dns_);
}
Console.ReadKey();
That what it does is collect the DNS of a page, put them in a string[] and then ping them one by one and those with less than 150ms of response are saved and printed on the console. I tried to do it with multithreads but it kept giving me errors and I would like to know how it would be to do this with for example 500 threads without any bugs in order to increase the speed of this process.
You could use the Parallel.ForEachAsync API, that was introduced in .NET 6.
var parsedDns = new ConcurrentQueue<string>();
var options = new ParallelOptions() { MaxDegreeOfParallelism = 10 };
Parallel.ForEachAsync(dns, options, async (dnsStr, ct) =>
{
Ping ping = new();
PingReply reply = await ping.SendPingAsync(dnsStr, 150);
if (reply.RoundtripTime <= 150)
{
parsedDns.Enqueue(dnsStr);
}
}).Wait();
The Parallel.ForEachAsync method returns a Task that you can either await, or simply Wait as in the above example.

C# - how to do multiple web requests at the same time

I wrote a code to check urls, however, ir works really slow.. I want to try to make it work on few urls at the same time, for example 10 urls or at least make it as fast as possible.
my Code:
Parallel.ForEach(urls, new ParallelOptions {
MaxDegreeOfParallelism = 10
}, s => {
try {
using(HttpRequest httpRequest = new HttpRequest()) {
httpRequest.UserAgent = "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:52.0) Gecko/20100101 Firefox/52.0";
httpRequest.Cookies = new CookieDictionary(false);
httpRequest.ConnectTimeout = 10000;
httpRequest.ReadWriteTimeout = 10000;
httpRequest.KeepAlive = true;
httpRequest.IgnoreProtocolErrors = true;
string check = httpRequest.Get(s + "'", null).ToString();
if (errors.Any(new Func < string, bool > (check.Contains))) {
Valid.Add(s);
Console.WriteLine(s);
File.WriteAllLines(Environment.CurrentDirectory + "/Good.txt", Valid);
}
}
} catch {
}
});
It is unlikely that your service calls are CPU-bound. So spinning up more threads to handle the load is maybe not the best approach-- you will get better throughput if you use async and await instead, if you can, using the more modern HttpClient instead of HttpRequest or HttpWebRequest.
Here is an example of how to do it:
var client = new HttpClient();
//Start with a list of URLs
var urls = new string[]
{
"http://www.google.com",
"http://www.bing.com"
};
//Start requests for all of them
var requests = urls.Select
(
url => client.GetAsync(url)
).ToList();
//Wait for all the requests to finish
await Task.WhenAll(requests);
//Get the responses
var responses = requests.Select
(
task => task.Result
);
foreach (var r in responses)
{
// Extract the message body
var s = await r.Content.ReadAsStringAsync();
Console.WriteLine(s);
}
Try doing as below.
Parallel.ForEach(urls, new ParallelOptions { MaxDegreeOfParallelism = Environment.ProcessorCount - 1 }
At least it makes sure that all the cores are used by leaving 1 so that your machine will not run out of memory.
Also, consider #KSib comment.

Using Multithreading with Async-await C#

I wrote an async function for calling data from Facebook, it works, but the problem is I dun suppose it works. Can someone explain to me?
public class FacebookData
{
static string fb_api_version = ConfigurationManager.AppSettings["fb_ver"];
static string accessToken = ConfigurationManager.AppSettings["accessToken"];
static string fb_id = "";
private HttpClient _httpClient;
public FacebookData(string input_id)
{
fb_id = input_id;
_httpClient = new HttpClient
{
BaseAddress = new Uri("https://graph.facebook.com/" + fb_api_version + "/"),
Timeout = TimeSpan.FromSeconds(15)
};
_httpClient.DefaultRequestHeaders.Accept.Add(new MediaTypeWithQualityHeaderValue("application/json"));
}
public async Task<T> getData<T>()
{
var response = await _httpClient.GetAsync($"{fb_id}?access_token={accessToken}");
if (!response.IsSuccessStatusCode)
return default(T);
var result = await response.Content.ReadAsStringAsync();
return JsonConvert.DeserializeObject<T>(result);
}
}
The calling class is typical, I make it await for the response.
But the problem is where I call it.
In main
static void Main(string[] args)
{
string[] data_Set = [//ids_group]
for (int i = 0; i < data_Set.length; ++i){
Console.WriteLine("Running Thread " + (i+1).ToString());
var dataSet = facebookRequestCmd(data_Set[i]);
writeToTXT(dataSet);
Console.WriteLine("Finished Thread " + (i + 1).ToString());
//do sth
}
}
In facebookRequestCmd
static Dictionary<string, string[]> facebookRequestCmd(string ids){
Dictionary<string, string[]> allData = new Dictionary<string, string[]>();
string[] ids_arr = ids.split(",")
for (var i = 0; i < ids.length; i++){
var facebook_client = new FacebookData(sqlData);
var response = facebook_client.getData<dynamic>();
Task.WaitAll(response);
//then use the result to do sth
}
}
In my understanding, each time I call getData, it already come back to main thread, as it is awaiting the data. So the Task doesn't really start a new thread.
Thus, async await works for waiting the http request, but the Threading should not work.
However,
Console.WriteLine("Running Thread " + (i+1).ToString());
jumps out simultaneously like I really make the Thread in the for loop in main function.
Why? And is that the way to use Multithreading with Async-await. As I want to make multiple calls at the same time.
Originally I use Parallel.ForEach to kick starting the calling, however, thats not asynchronous and will block the thread.
Ok, feel free to ignore all the changes I've made but I couldn't help but modify the way some of the variables read and the code looked. This is not a working application and I have obviously not tested it. This is just a cleaned up version of what you have with a suggested way of using Task. It's also mocked using just the code you've provided so it is what it is. #2 is, what I believe, the answer you needed.
In Main I removed the words 'thread' since that's not actually what's happening. It may be, but we don't know if the HttpClient is indeed starting a new thread or just holding / returning from the rest call. Using async / await does not always mean a Thread was started (although it's common to think of it that way).
I used .Result (not Wait() as I suggested in comments) to get the result of the task. This is ok since it's a console app but not ideal for a real world app that needs to operate without blocking. I also removed Task.WaitAll with this change.
I renamed functions to have verbage because, IMO, functions should be doing work and the naming should describe the work being done.
I renamed some variables because, IMO, variables should be PascalCase when their scope isn't in a method or private and camelCase when they are. The names should also, IMO, be what it is followed by the Type that makes sense.
I appended 'Async' to function names that return a running Task.
Changed FacebookClient to be singleton and allowing only one HttpClient to be used instead of many and allowing it to be disposed; plus more.
Added alternate version of the GetFacebookData function that calls the tasks and awaits them all simultaneously.
static void Main(string[] args)
{
string[] dataSet = new string[] { /* mocked */ }; // [ids_group]; <- no idea what this is so I mocked it.
for (int i = 0; i < dataSet.Length; i++)
{
Console.WriteLine("Main... " + (i + 1).ToString());
var result = GetFacebookData(dataSet[i]);
WriteToTxt(result);
Console.WriteLine("Complete... " + (i + 1).ToString());
//do sth
}
Console.Read();
}
private static Dictionary<string, string[]> GetFacebookData(string idsString)
{
var allDataDictionary = new Dictionary<string, string[]>();
var idsArray = idsString.Split(',');
foreach (var id in idsArray)
{
var response = FacebookClient.Instance.GetDataAsync<string[]>(id).Result;
allDataDictionary.Add(id, response);
}
return allDataDictionary;
}
public class FacebookClient
{
private readonly HttpClient httpClient;
private readonly string facebookApiVersion;
private readonly string accessToken;
public static FacebookClient Instance { get; } = new FacebookClient();
FacebookClient()
{
facebookApiVersion = ConfigurationManager.AppSettings["fb_ver"];
accessToken = ConfigurationManager.AppSettings["accessToken"];
httpClient = new HttpClient
{
BaseAddress = new Uri("https://graph.facebook.com/" + facebookApiVersion + "/"),
Timeout = TimeSpan.FromSeconds(15)
};
httpClient.DefaultRequestHeaders.Accept.Add(new MediaTypeWithQualityHeaderValue("application/json"));
}
public async Task<T> GetDataAsync<T>(string facebookId)
{
var response = await httpClient.GetAsync($"{facebookId}?access_token={accessToken}");
if (!response.IsSuccessStatusCode) return default;
var result = await response.Content.ReadAsStringAsync();
return JsonConvert.DeserializeObject<T>(result);
}
~FacebookClient() => httpClient.Dispose();
}
Here's a version that's starting all the tasks and then awaiting them all at the same time. I believe this might give you some issues on the HttpClient but we'll see.
private static Dictionary<string, string[]> GetFacebookData(string idsString)
{
var allDataDictionary = new Dictionary<string, string[]>();
var idsArray = idsString.Split(',');
var getDataTasks = new List<Task<string[]>>();
foreach (var id in idsArray)
{
getDataTasks.Add(FacebookClient.Instance.GetDataAsync<string[]>(id));
}
var tasksArray = getDataTasks.ToArray();
Task.WaitAll(tasksArray);
var resultsArray = tasksArray.Select(task => task.Result).ToArray();
for (var i = 0; i < idsArray.Length; i++)
{
allDataDictionary.Add(idsArray[i], resultsArray[i]);
}
return allDataDictionary;
}

DateTime difference inside Thread

I have a for-loop that creates a new Thread each iteration. In short, my loop is creating 20 threads that does some action, at the same time.
My goal inside each of these threads, is to create a DateTime variable with a start time, execute an operation, and create a DateTime variable with an end time. Hereafter I'll take the difference between these two variables to find out, how long this operation took in this SPECIFIC thread. Then log it out.
However that isn't working as expected, and I'm confused on why.
It seems like it justs "adds" the time to the variables, each iteration of a new thread, instead of creating a completely new and fresh version of the variable, only to be taking into consideration in that specific thread.
This is my for-loop code:
for(int i = 0; i < 20; i++)
{
Thread thread = new Thread(() =>
{
Stopwatch sw = new Stopwatch();
sw.Start();
RESTRequest(Method.POST, ....),
sw.Stop();
Console.WriteLine("Result took (" + sw.Elapsed.Seconds + " seconds, " + sw.Elapsed.Milliseconds + " milliseconds)");
});
thread.IsBackground = true;
thread.Start();
}
Long operation function:
public static string RESTRequest(Method method, string endpoint, string resource, string body, SimplytureRESTRequestHeader[] requestHeaders = null, SimplytureRESTResponseHeader[] responseHeaders = null, SimplytureRESTAuthentication authentication = null, SimplytureRESTParameter[] parameters = null)
{
var client = new RestClient(endpoint);
if(authentication != null)
{
client.Authenticator = new HttpBasicAuthenticator(authentication.username, authentication.password);
}
var request = new RestRequest(resource, method);
if (requestHeaders != null)
{
foreach (var header in requestHeaders)
{
request.AddHeader(header.headerType, header.headerValue);
}
}
if(body != null)
{
request.AddParameter("text/json", body, ParameterType.RequestBody);
}
if(parameters != null)
{
foreach (var parameter in parameters)
{
request.AddParameter(parameter.key, parameter.value);
}
}
IRestResponse response = client.Execute(request);
if (responseHeaders != null)
{
foreach (var header in responseHeaders)
{
var par = new Parameter();
par.Name = header.headerType;
par.Value = header.headerValue;
response.Headers.Add(par);
}
}
var content = response.Content;
return content;
}
This is my results:
EDIT:
I also tried using the Stopwatch class, but it didn't do any difference, but definitely more handy. I also Added the long operation for debugging.
There is a limitation for concurrent calls to the same ServicePoint.
The default is 2 concurrent connections for each unique ServicePoint.
Add System.Net.ServicePointManager.DefaultConnectionLimit = 20; to raise that limit to match the thread count.
System.Net.ServicePointManager.DefaultConnectionLimit
You can also set this value in config file
<system.net>
<connectionManagement>
<add address="*" maxconnection="20" />
</connectionManagement>
</system.net>

Processing large number of tasks concurrently and asynchronously

I would like to process a list of 50,000 urls through a web service, The provider of this service allows 5 connections per second.
I need to process these urls in parallel with adherence to provider's rules.
This is my current code:
static void Main(string[] args)
{
process_urls().GetAwaiter().GetResult();
}
public static async Task process_urls()
{
// let's say there is a list of 50,000+ URLs
var urls = System.IO.File.ReadAllLines("urls.txt");
var allTasks = new List<Task>();
var throttler = new SemaphoreSlim(initialCount: 5);
foreach (var url in urls)
{
await throttler.WaitAsync();
allTasks.Add(
Task.Run(async () =>
{
try
{
Console.WriteLine(String.Format("Starting {0}", url));
var client = new HttpClient();
var xml = await client.GetStringAsync(url);
//do some processing on xml output
client.Dispose();
}
finally
{
throttler.Release();
}
}));
}
await Task.WhenAll(allTasks);
}
Instead of var client = new HttpClient(); I will create a new object of the target web service but this is just to make the code generic.
Is this the correct approach to handle and process a huge list of connections? and is there anyway I can limit the number of established connections per second to 5 as the current implementation will not consider any timeframe?
Thanks
Reading values from web service is IO operation which can be done asynchronously without multithreading.
Threads do nothing - only waiting for response in this case. So using parallel is just wasting of resources.
public static async Task process_urls()
{
var urls = System.IO.File.ReadAllLines("urls.txt");
var allTasks = new List<Task>();
var throttler = new SemaphoreSlim(initialCount: 5);
foreach (var urlGroup in SplitToGroupsOfFive(urls))
{
var tasks = new List<Task>();
foreach(var url in urlGroup)
{
var task = ProcessUrl(url);
tasks.Add(task);
}
// This delay will sure that next 5 urls will be used only after 1 seconds
tasks.Add(Task.Delay(1000));
await Task.WhenAll(tasks.ToArray());
}
}
private async Task ProcessUrl(string url)
{
using (var client = new HttpClient())
{
var xml = await client.GetStringAsync(url);
//do some processing on xml output
}
}
private IEnumerable<IEnumerable<string>> SplitToGroupsOfFive(IEnumerable<string> urls)
{
var const GROUP_SIZE = 5;
var string[] group = null;
var int count = 0;
foreach (var url in urls)
{
if (group == null)
group = new string[GROUP_SIZE];
group[count] = url;
count++;
if (count < GROUP_SIZE)
continue;
yield return group;
group = null;
count = 0;
}
if (group != null && group.Length > 0)
{
yield return group.Take(group.Length);
}
}
Because you mention that "processing" of response is also IO operation, then async/await approach is most efficient, because it using only one thread and process other tasks when previous tasks waiting for response from web service or from file writing IO operations.

Categories