I did this:
WebClient client = new WebClient();
string[] dns = client.DownloadString("https://public-dns.info/nameservers.txt")
.Split('\n');
List<string> parsedDns = new List<string>();
foreach (string dnsStr in dns)
{
Ping ping = new Ping();
if (dnsStr.Contains(":"))
{
}
else if (ping.SendPingAsync(dnsStr, 150).Result.RoundtripTime <= 150)
{
parsedDns.Add(dnsStr);
}
}
foreach (var dns_ in parsedDns.ToArray())
{
Console.WriteLine(dns_);
}
Console.ReadKey();
That what it does is collect the DNS of a page, put them in a string[] and then ping them one by one and those with less than 150ms of response are saved and printed on the console. I tried to do it with multithreads but it kept giving me errors and I would like to know how it would be to do this with for example 500 threads without any bugs in order to increase the speed of this process.
You could use the Parallel.ForEachAsync API, that was introduced in .NET 6.
var parsedDns = new ConcurrentQueue<string>();
var options = new ParallelOptions() { MaxDegreeOfParallelism = 10 };
Parallel.ForEachAsync(dns, options, async (dnsStr, ct) =>
{
Ping ping = new();
PingReply reply = await ping.SendPingAsync(dnsStr, 150);
if (reply.RoundtripTime <= 150)
{
parsedDns.Enqueue(dnsStr);
}
}).Wait();
The Parallel.ForEachAsync method returns a Task that you can either await, or simply Wait as in the above example.
Related
I wrote a web crawler and I want to know if my approach is correct. The only issue I'm facing is that it stops after some hours of crawling. No exception, it just stops.
1 - the private members and the constructor:
private const int CONCURRENT_CONNECTIONS = 5;
private readonly HttpClient _client;
private readonly string[] _services = new string[2] {
"https://example.com/items?id=ID_HERE",
"https://another_example.com/items?id=ID_HERE"
}
private readonly List<SemaphoreSlim> _semaphores;
public Crawler() {
ServicePointManager.DefaultConnectionLimit = CONCURRENT_CONNECTIONS;
_client = new HttpClient();
_semaphores = new List<SemaphoreSlim>();
foreach (var _ in _services) {
_semaphores.Add(new SemaphoreSlim(CONCURRENT_CONNECTIONS));
}
}
Single HttpClient instance.
The _services is just a string array that contains the URL, they are not the same domain.
I'm using semaphores (one per domain) since I read that it's not a good idea to use the network queue (I don't remember how it calls).
2 - The Run method, which is the one I will call to start crawling.
public async Run(List<int> ids) {
const int BATCH_COUNT = 1000;
var svcIndex = 0;
var tasks = new List<Task<string>>(BATCH_COUNT);
foreach (var itemId in ids) {
tasks.Add(DownloadItem(svcIndex, _services[svcIndex].Replace("ID_HERE", $"{itemId}")));
if (++svcIndex >= _services.Length) {
svcIndex = 0;
}
if (tasks.Count >= BATCH_COUNT) {
var results = await Task.WhenAll(tasks);
await SaveDownloadedData(results);
tasks.Clear();
}
}
if (tasks.Count > 0) {
var results = await Task.WhenAll(tasks);
await SaveDownloadedData(results);
tasks.Clear();
}
}
DownloadItem is an async function that actually makes the GET request, note that I'm not awaiting it here.
If the number of tasks reaches the BATCH_COUNT, I will await all to complete and save the results to file.
3 - The DownloadItem function.
private async Task<string> DownloadItem(int serviceIndex, string link) {
var needReleaseSemaphore = true;
var result = string.Empty;
try {
await _semaphores[serviceIndex].WaitAsync();
var r = await _client.GetStringAsync(link);
_semaphores[serviceIndex].Release();
needReleaseSemaphore = false;
// DUE TO JSON SIZE, I NEED TO REMOVE A VALUE (IT'S USELESS FOR ME)
var obj = JObject.Parse(r);
if (obj.ContainsKey("blah")) {
obj.Remove("blah");
}
result = obj.ToString(Formatting.None);
} catch {
result = string.Empty;
// SINCE I GOT AN EXCEPTION, I WILL 'LOCK' THIS SERVICE FOR 1 MINUTE.
// IF I RELEASED THIS SEMAPHORE, I WILL LOCK IT AGAIN FIRST.
if (!needReleaseSemaphore) {
await _semaphores[serviceIndex].WaitAsync();
needReleaseSemaphore = true;
}
await Task.Delay(60_000);
} finally {
// RELEASE THE SEMAPHORE, IF NEEDED.
if (needReleaseSemaphore) {
_semaphores[serviceIndex].Release();
}
}
return result;
}
4- The function that saves the result.
private async Task SaveDownloadedData(List<string> myData) {
using var fs = new FileStream("./output.dat", FileMode.Append);
foreach (var res in myData) {
var blob = Encoding.UTF8.GetBytes(res);
await fs.WriteAsync(BitConverter.GetBytes((uint)blob.Length));
await fs.WriteAsync(blob);
}
await fs.DisposeAsync();
}
5- Finally, the Main function.
static async Task Main(string[] args) {
var crawler = new Crawler();
var items = LoadItemIds();
await crawler.Run(items);
}
After all this, is my approach correct? I need to make millions of requests, will take some weeks/months to gather all data I need (due to the connection limit).
After 12 - 14 hours, it just stops and I need to manually restart the app (memory usage is ok, my VPS has 1 GB and it never used more than 60%).
I am doing some performance tests on ZeroMQ in order to compare it with others like RabbitMQ and ActiveMQ.
In my broadcast tests and to avoid "The Dynamic Discovery Problem" as referred by ZeroMQ documentation I have used a proxy. In my scenario, I am using 50 concurrent publishers each one sending 500 messages with 1ms delay between sends. Each message is then read by 50 subscribers. And as I said I am losing messages, each of the subscribers should receive a total of 25000 messages and they are each receiving between 5000 and 10000 messages only.
I am using Windows and C# .Net client clrzmq4 (4.1.0.31).
I have already tried some solutions that I found on other posts:
I have set linger to TimeSpan.MaxValue
I have set ReceiveHighWatermark to 0 (as it is presented as infinite, but I have tried also Int32.MaxValue)
I have set checked for slow start receivers, I made receivers start some seconds before publishers
I had to make sure that no garbage collection is made to the socket instances (linger should do it but to make sure)
I have a similar scenario (with similar logic) using NetMQ and it works fine. The other scenario does not use security though and this one does (and that's also the reason why I use clrzmq in this one because I need client authentication with certificates that is not yet possible on NetMQ).
EDIT:
public class MCVEPublisher
{
public void publish(int numberOfMessages)
{
string topic = "TopicA";
ZContext ZContext = ZContext.Create();
ZSocket publisher = new ZSocket(ZContext, ZSocketType.PUB);
//Security
// Create or load certificates
ZCert serverCert = Main.GetOrCreateCert("publisher");
var actor = new ZActor(ZContext, ZAuth.Action, null);
actor.Start();
// send CURVE settings to ZAuth
actor.Frontend.Send(new ZFrame("VERBOSE"));
actor.Frontend.Send(new ZMessage(new List<ZFrame>()
{ new ZFrame("ALLOW"), new ZFrame("127.0.0.1") }));
actor.Frontend.Send(new ZMessage(new List<ZFrame>()
{ new ZFrame("CURVE"), new ZFrame(".curve") }));
publisher.CurvePublicKey = serverCert.PublicKey;
publisher.CurveSecretKey = serverCert.SecretKey;
publisher.CurveServer = true;
publisher.Linger = TimeSpan.MaxValue;
publisher.ReceiveHighWatermark = Int32.MaxValue;
publisher.Connect("tcp://127.0.0.1:5678");
Thread.Sleep(3500);
for (int i = 0; i < numberOfMessages; i++)
{
Thread.Sleep(1);
var update = $"{topic} {"message"}";
using (var updateFrame = new ZFrame(update))
{
publisher.Send(updateFrame);
}
}
//just to make sure it does not end instantly
Thread.Sleep(60000);
//just to make sure publisher is not garbage collected
ulong Affinity = publisher.Affinity;
}
}
public class MCVESubscriber
{
private ZSocket subscriber;
private List<string> prints = new List<string>();
public void read()
{
string topic = "TopicA";
var context = new ZContext();
subscriber = new ZSocket(context, ZSocketType.SUB);
//Security
ZCert serverCert = Main.GetOrCreateCert("xpub");
ZCert clientCert = Main.GetOrCreateCert("subscriber");
subscriber.CurvePublicKey = clientCert.PublicKey;
subscriber.CurveSecretKey = clientCert.SecretKey;
subscriber.CurveServer = true;
subscriber.CurveServerKey = serverCert.PublicKey;
subscriber.Linger = TimeSpan.MaxValue;
subscriber.ReceiveHighWatermark = Int32.MaxValue;
// Connect
subscriber.Connect("tcp://127.0.0.1:1234");
subscriber.Subscribe(topic);
while (true)
{
using (var replyFrame = subscriber.ReceiveFrame())
{
string messageReceived = replyFrame.ReadString();
messageReceived = Convert.ToString(messageReceived.Split(' ')[1]);
prints.Add(messageReceived);
}
}
}
public void PrintMessages()
{
Console.WriteLine("printing " + prints.Count);
}
}
public class Main
{
static void Main(string[] args)
{
broadcast(500, 50, 50, 30000);
}
public static void broadcast(int numberOfMessages, int numberOfPublishers, int numberOfSubscribers, int timeOfRun)
{
new Thread(() =>
{
using (var context = new ZContext())
using (var xsubSocket = new ZSocket(context, ZSocketType.XSUB))
using (var xpubSocket = new ZSocket(context, ZSocketType.XPUB))
{
//Security
ZCert serverCert = GetOrCreateCert("publisher");
ZCert clientCert = GetOrCreateCert("xsub");
xsubSocket.CurvePublicKey = clientCert.PublicKey;
xsubSocket.CurveSecretKey = clientCert.SecretKey;
xsubSocket.CurveServer = true;
xsubSocket.CurveServerKey = serverCert.PublicKey;
xsubSocket.Linger = TimeSpan.MaxValue;
xsubSocket.ReceiveHighWatermark = Int32.MaxValue;
xsubSocket.Bind("tcp://*:5678");
//Security
serverCert = GetOrCreateCert("xpub");
var actor = new ZActor(ZAuth.Action0, null);
actor.Start();
// send CURVE settings to ZAuth
actor.Frontend.Send(new ZFrame("VERBOSE"));
actor.Frontend.Send(new ZMessage(new List<ZFrame>()
{ new ZFrame("ALLOW"), new ZFrame("127.0.0.1") }));
actor.Frontend.Send(new ZMessage(new List<ZFrame>()
{ new ZFrame("CURVE"), new ZFrame(".curve") }));
xpubSocket.CurvePublicKey = serverCert.PublicKey;
xpubSocket.CurveSecretKey = serverCert.SecretKey;
xpubSocket.CurveServer = true;
xpubSocket.Linger = TimeSpan.MaxValue;
xpubSocket.ReceiveHighWatermark = Int32.MaxValue;
xpubSocket.Bind("tcp://*:1234");
using (var subscription = ZFrame.Create(1))
{
subscription.Write(new byte[] { 0x1 }, 0, 1);
xpubSocket.Send(subscription);
}
Console.WriteLine("Intermediary started, and waiting for messages");
// proxy messages between frontend / backend
ZContext.Proxy(xsubSocket, xpubSocket);
Console.WriteLine("end of proxy");
//just to make sure it does not end instantly
Thread.Sleep(60000);
//just to make sure xpubSocket and xsubSocket are not garbage collected
ulong Affinity = xpubSocket.Affinity;
int ReceiveHighWatermark = xsubSocket.ReceiveHighWatermark;
}
}).Start();
Thread.Sleep(5000); //to make sure proxy started
List<MCVESubscriber> Subscribers = new List<MCVESubscriber>();
for (int i = 0; i < numberOfSubscribers; i++)
{
MCVESubscriber ZeroMqSubscriber = new MCVESubscriber();
new Thread(() =>
{
ZeroMqSubscriber.read();
}).Start();
Subscribers.Add(ZeroMqSubscriber);
}
Thread.Sleep(10000);//to make sure all subscribers started
for (int i = 0; i < numberOfPublishers; i++)
{
MCVEPublisher ZeroMqPublisherBroadcast = new MCVEPublisher();
new Thread(() =>
{
ZeroMqPublisherBroadcast.publish(numberOfMessages);
}).Start();
}
Thread.Sleep(timeOfRun);
foreach (MCVESubscriber Subscriber in Subscribers)
{
Subscriber.PrintMessages();
}
}
public static ZCert GetOrCreateCert(string name, string curvpath = ".curve")
{
ZCert cert;
string keyfile = Path.Combine(curvpath, name + ".pub");
if (!File.Exists(keyfile))
{
cert = new ZCert();
Directory.CreateDirectory(curvpath);
cert.SetMeta("name", name);
cert.Save(keyfile);
}
else
{
cert = ZCert.Load(keyfile);
}
return cert;
}
}
This code also produces the expected number of messages when security is disabled, but when turned on it doesn't.
Does someone know another thing to check? Or has it happened to anyone before?
Thanks
I would like to process a list of 50,000 urls through a web service, The provider of this service allows 5 connections per second.
I need to process these urls in parallel with adherence to provider's rules.
This is my current code:
static void Main(string[] args)
{
process_urls().GetAwaiter().GetResult();
}
public static async Task process_urls()
{
// let's say there is a list of 50,000+ URLs
var urls = System.IO.File.ReadAllLines("urls.txt");
var allTasks = new List<Task>();
var throttler = new SemaphoreSlim(initialCount: 5);
foreach (var url in urls)
{
await throttler.WaitAsync();
allTasks.Add(
Task.Run(async () =>
{
try
{
Console.WriteLine(String.Format("Starting {0}", url));
var client = new HttpClient();
var xml = await client.GetStringAsync(url);
//do some processing on xml output
client.Dispose();
}
finally
{
throttler.Release();
}
}));
}
await Task.WhenAll(allTasks);
}
Instead of var client = new HttpClient(); I will create a new object of the target web service but this is just to make the code generic.
Is this the correct approach to handle and process a huge list of connections? and is there anyway I can limit the number of established connections per second to 5 as the current implementation will not consider any timeframe?
Thanks
Reading values from web service is IO operation which can be done asynchronously without multithreading.
Threads do nothing - only waiting for response in this case. So using parallel is just wasting of resources.
public static async Task process_urls()
{
var urls = System.IO.File.ReadAllLines("urls.txt");
var allTasks = new List<Task>();
var throttler = new SemaphoreSlim(initialCount: 5);
foreach (var urlGroup in SplitToGroupsOfFive(urls))
{
var tasks = new List<Task>();
foreach(var url in urlGroup)
{
var task = ProcessUrl(url);
tasks.Add(task);
}
// This delay will sure that next 5 urls will be used only after 1 seconds
tasks.Add(Task.Delay(1000));
await Task.WhenAll(tasks.ToArray());
}
}
private async Task ProcessUrl(string url)
{
using (var client = new HttpClient())
{
var xml = await client.GetStringAsync(url);
//do some processing on xml output
}
}
private IEnumerable<IEnumerable<string>> SplitToGroupsOfFive(IEnumerable<string> urls)
{
var const GROUP_SIZE = 5;
var string[] group = null;
var int count = 0;
foreach (var url in urls)
{
if (group == null)
group = new string[GROUP_SIZE];
group[count] = url;
count++;
if (count < GROUP_SIZE)
continue;
yield return group;
group = null;
count = 0;
}
if (group != null && group.Length > 0)
{
yield return group.Take(group.Length);
}
}
Because you mention that "processing" of response is also IO operation, then async/await approach is most efficient, because it using only one thread and process other tasks when previous tasks waiting for response from web service or from file writing IO operations.
New to multi-threded apps.
I am trying to create a console app to check a given list of IP addresses (intranet). Each web page for any given IP address contains some stats, displayed in an html table, that I need to collect.
I can do this in a single thread: set up the request/response sequence, get the page content and parse it.
What I am struggling with right now is to make this multi-threaded since I have to deal with 4000 IP addresses and single thread would take some time. I have the list of IPs in a list or array of strings; do you know how I can set up the threads?
Assuming I have a function that processes the response, say, "ProcessResponse(string s)", and want to start with 10 threads, can I start with something like:
public class PASSServer
{
private string _ip;
public string IPAddress
{
get;
set;
}
public PASSServer()
{
}
}
static void Main(string[] args)
{
int iNumThreads = 3;
Thread[] threads = new Thread[iNumThreads];
string[] sIPs = { "192.168.10.20", "192.168.10.21", "192.168.10.22" };
for (int i = 0; i < threads.Length; i++)
{
ParameterizedThreadStart start = new ParameterizedThreadStart(Start);
threads[i] = new Thread(start);
PASSServer pserver = new PASSServer();
pserver.IPAddress = sIPs[i];
threads[i].Start(pserver);
}
Console.WriteLine("DONE");
Console.ReadKey();
}
static void Start(object info)
{
PASSServer pserver = (PASSServer)info;
crawl(pserver.IPAddress);
}
private static void crawl(string sUrl)
{
PASSData cData = new PASSData();
string sRequestUrl = "http://" + sUrl.Trim() + "/cgi-bin/sysstat?";
string sEncodingType = "utf-8";
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(sRequestUrl);
request.KeepAlive = true;
request.Timeout = 15 * 1000;
System.Net.HttpWebResponse response = (HttpWebResponse)request.GetResponse();
string sStatus = ((HttpWebResponse)response).StatusDescription;
sEncodingType = GetEncodingType(response);
System.IO.StreamReader reader = new System.IO.StreamReader(response.GetResponseStream(), Encoding.GetEncoding(sEncodingType));
// Read the content.
string responseFromServer = reader.ReadToEnd();
Console.WriteLine(responseFromServer);
}
Any help is greatly appreciated.
I have not used multi threading but googled the subject and got some ideas just not sure how best to set up my scenario.
Don't use threads. Use asynchronous HTTP requests. For example, use HttpWebRequest.BeginGetResponse or perhaps HttpWebRequest.GetResponseAsync. Limit the number of concurrent requests using a Semaphore.
So, if you have a list of URLs (a List<string>) and you want a maximum of 10 concurrent requests:
List<string> _urls = GetListOfUrls();
Semaphore _requestSemaphore = new Semaphore(10, 10);
foreach (var url in _urls)
{
// wait for an available spot
_requestSemaphore.WaitOne();
// Now start an asynchronous request with this url
var request = (HttpWebRequest)WebRequest.Create(url);
request.BeginGetResponse(GetResponseCallback, request);
}
When your list is empty, you have to wait for the final responses to be received. The way you do that is to wait on the semaphore 10 times. When you've got 10, then there can't be any outstanding requests:
for (int i = 0; i < 10; ++i)
{
_requestSemaphore.WaitOne();
}
And your callback, which is called when a response is received:
void GetResponseCallback(IAsyncResult ar)
{
var request = (HttpWebRequest)ar.AsyncState;
var response = (HttpWebResponse)request.EndGetResponse(ar);
// process the response here.
// when you're done processing the response, release the semaphore
_requestSemaphore.Release();
}
I would loop through your list of IP addresses and start a ThreadPool work item.
foreach(string addr in IpAddresses)
Threading.ThreadPool.QueueUserWorkItem(
(string ipaddr) =>
{
ResponseFromQuery resp = new ResponseFromQuery();
this.BeginInvoke(new MethodInvoker(() => { UpdateTable(resp); }));
}, addr);
*EDIT: Above, you will need to call BeginInvoke and create a methodinvoker that calls back to a new method in your application call UpdateTable. You can pass in your response information (whatever type it is, I used a made up ResponseFromQuery class for example).
You can use either an anonymous function or, if there is a lot of code and you might use it elsewhere, you could create a processing class and method that you can pass as your method that you want executed.
If you wanted to manage your threads yourself, you can create a Dictionary or List object and add a thread to that for each item in your collection:
Dictionary<string, Thread> _threads = new Dictionary<string, Thread>();
foreach (string addr in IpAddresses)
{
_threads.Add(addr, new System.Threading.Thread(
new System.Threading.ParameterizedThreadStart(
(object ip) =>
{
// process ip.
}, addr)));
_threads[addr].Start();
}
I'm using the following code to post an image to a server.
var image= Image.FromFile(#"C:\Image.jpg");
Task<string> upload = Upload(image);
upload.Wait();
public static async Task<string> Upload(Image image)
{
var uriBuilder = new UriBuilder
{
Host = "somewhere.net",
Path = "/path/",
Port = 443,
Scheme = "https",
Query = "process=false"
};
using (var client = new HttpClient())
{
client.DefaultRequestHeaders.Add("locale", "en_US");
client.DefaultRequestHeaders.Add("country", "US");
var content = ConvertToHttpContent(image);
content.Headers.ContentType = MediaTypeHeaderValue.Parse("image/jpeg");
using (var mpcontent = new MultipartFormDataContent("--myFakeDividerText--")
{
{content, "fakeImage", "myFakeImageName.jpg"}
}
)
{
using (
var message = await client.PostAsync(uriBuilder.Uri, mpcontent))
{
var input = await message.Content.ReadAsStringAsync();
return "nothing for now";
}
}
}
}
I'd like to modify this code to run multiple threads. I've used "ThreadPool.QueueUserWorkItem" before and started to modify the code to leverage it.
private void UseThreadPool()
{
int minWorker, minIOC;
ThreadPool.GetMinThreads(out minWorker, out minIOC);
ThreadPool.SetMinThreads(1, minIOC);
int maxWorker, maxIOC;
ThreadPool.GetMaxThreads(out maxWorker, out maxIOC);
ThreadPool.SetMinThreads(4, maxIOC);
var events = new List<ManualResetEvent>();
foreach (var image in ImageCollection)
{
var resetEvent = new ManualResetEvent(false);
ThreadPool.QueueUserWorkItem(
arg =>
{
var img = Image.FromFile(image.getPath());
Task<string> upload = Upload(img);
upload.Wait();
resetEvent.Set();
});
events.Add(resetEvent);
if (events.Count <= 0) continue;
foreach (ManualResetEvent e in events) e.WaitOne();
}
}
The problem is that only one thread executes at a time due to the call to "upload.Wait()". So I'm still executing each thread in sequence. It's not clear to me how I can use PostAsync with a thread-pool.
How can I post images to a server using multiple threads by tweaking the code above? Is HttpClient PostAsync the best way to do this?
I'd like to modify this code to run multiple threads.
Why? The thread pool should only be used for CPU-bound work (and I/O completions, of course).
You can do concurrency just fine with async:
var tasks = ImageCollection.Select(image =>
{
var img = Image.FromFile(image.getPath());
return Upload(img);
});
await Task.WhenAll(tasks);
Note that I removed your Wait. You should avoid using Wait or Result with async tasks; use await instead. Yes, this will cause async to grow through you code, and you should use async "all the way".