There is a List.
I want to download each url via webclient.DownloadStringAsync
the problem I encounter is:
how do I know which e.Result corresponds to what url ?
public class ressource{
public string url { get; set; }
public string result { get; set; }
}
List<ressource> urlist = new List<ressource>();
urlist.Add(new ressource(){url="blabla", result=string.empty});
....etc
var wc= new WebClient();
foreach(var item in urlist)
{
wc.DownloadStringCompleted += new DownloadStringCompletedEventHandler(wc_DownloadStringCompleted);
wc.DownloadStringAsync(new Uri(item.url, UriKind.Absolute));
}
void wc_DownloadStringCompleted(object sender, DownloadStringCompletedEventArgs e)
{
urlist[?].result = e.Result;
}
I feel completely stuck.
Thanks for your ideas.
the problem I encounter is: how do I know which e.Result corresponds to what url ?
There are various different options for this:
UserState
You can pass in a second argument to DownloadStringAsync, which is then available via DownloadStringCompletedEventArgs.UserState. For example:
// In your loop....
var wc = new WebClient();
wc.DownloadStringAsync(new Uri(item.url, UriKind.Absolute), item);
void wc_DownloadStringCompleted(object sender, DownloadStringCompletedEventArgs e)
{
var item = (ressource) e.UserState;
item.result = e.Result;
}
Multiple WebClients
You can create a new WebClient for each iteration of the loop, and attach a different event handler to it. A lambda expression is useful here:
// Note: this is broken in C# 3 and 4 due to the capture semantics of foreach.
// It should be fine in C# 5 though.
foreach(var item in urlist)
{
var wc = new WebClient();
wc.DownloadStringCompleted += (sender, args) => item.result = args.Result;
wc.DownloadStringAsync(new Uri(item.url, UriKind.Absolute));
}
DownloadStringTaskAsync
You could DownloadStringTaskAsync instead, so that each call returns a Task<string>. You could keep a collection of these - one for each element in urlist - and know which is which that way.
Alternatively, you could just fetch all the results synchronously, but I suspect you don't want to do that.
Additional information
Unfortunately, WebClient doesn't support multiple concurrent connections, so with all the options above you should create a new WebClient per iteration anyway.
Another alternative, and the one I prefer, is to use Microsoft's Reactive Framework (Rx). It handles all the background threading for you, similar to the TPL, but often easier.
Here's how I would do it:
var query =
from x in urlist.ToObservable()
from result in Observable.Using(
() => new WebClient(),
wc => Observable.Start(() => wc.DownloadString(x.url)))
select new
{
x.url,
result
};
Now to get the results back into the original urlist.
var lookup = urlist.ToDictionary(x => x.url);
query.Subscribe(x =>
{
lookup[x.url].result = x.result;
});
Simple as that.
Related
From m$ site. I don't get the += o,a what is that ???
private void GetResponse(Uri uri, Action<Response> callback)
{
WebClient wc = new WebClient();
wc.OpenReadCompleted += (o, a) =>
{
if (callback != null)
{
DataContractJsonSerializer ser = new DataContractJsonSerializer(typeof(Response));
callback(ser.ReadObject(a.Result) as Response);
}
};
wc.OpenReadAsync(uri);
}
wc.OpenReadCompleted += (o, a) => { }
This is assigning an anonymous delegate to the wc.OpenReadCompleted event. The (o,a) part are the method parameters.
o is object.
a is the EventArgs
As I can see from the signature of OpenReadCompletedEventHandler (which should be used to subscribe to OpenReadCompleted event), o is a sender and a is an instance of OpenReadCompletedEventArgs.
In general this approach of subscription to events is basically instantiating a delegate using a lambda expression, one can do this since C# 3.0.
A 3rd party has supplied an interface which allows me to search their database for customers and retrieve their details. Eg. Pictures, date of birth etc.
I imported their WSDL into Visual Studio and am using the Async methods to retrieve the customer details.
MyClient Client = new MyClient();
Client.FindCustomersCompleted += FindCustomersCompleted;
Client.GetCustomerDetailsCompleted += GetCustomerDetailsCompleted;
Client.FindCustomersAsync("Jones");
Below are the two events which deal with the responses.
void FindCustomersCompleted(object sender, FindCustomersCompletedEventArgs e)
{
foreach(var Cust in e.Customers)
{
Client.GetCustomerDetailsAsync(Cust.ID);
}
}
void GetCustomerDetailsCompleted(object sender, GetCustomerDetailsCompletedEventArgs e)
{
// Add the customer details to the result box on the Window.
}
So lets assume that my initial search for "Jones" returns no results or causes an error. Its fairly straight forward to tell the user that there was an error or no results found as I will only receive a single response.
However if i say get 50 results for "Jones" then i make 50 GetCustomerDetailsAsync calls and get 50 responses.
Lets say that something goes wrong on the server side and i don't get any valid responses. Each GetCustomerDetailsCompleted event will receive an error/timeout and i can determine that that individual response has failed.
What is the best way to determine that All of my responses have failed and i need to inform the user that there has been a failure?
Alternatively what if 1 out of 50 succeeds?
Should i keep track of my requests and flag them as successful as i receive the response?
Should i keep track of my requests and flag them as successful as i
receive the response?
This is also how I manage multiple requests. Flag if the returned result is without fault and track the flags, evaluating after each return if you already processed all returns. I do not of another way.
I would start by converting Event-based Asynchronous Pattern model to Task based. This would allow to use built in await/async keywords resulting in much easier to use code.
Here is a simple implementation: https://stackoverflow.com/a/15316668/3070052
In your case I would not update UI on each event but gathered all the information in a single variable and only displayed only when I get all the results.
Here is a code to get you going:
public class CustomerDetails
{
public int Id {get; set;}
public string Name {get; set;}
}
public class FindCustomersResult
{
public FindCustomersResult()
{
CustomerDetails = new List<CustomerDetails>();
}
public List<CustomerDetails> CustomerDetails {get; set;}
}
public class ApiWrapper
{
public Task<FindCustomersResult> FindCustomers(string customerName)
{
var tcs = new TaskCompletionSource<FindCustomersResult>();
var client = new MyClient();
client.FindCustomersCompleted += (object sender, FindCustomersCompletedEventArgs e) =>
{
var result = new FindCustomersResult();
foreach(var customer in e.Customers)
{
var customerDetails = await GetCustomerDetails(customer.ID);
result.CustomerDetails.Add(customerDetails);
}
tcs.SetResult(result);
}
client.FindCustomersAsync(customerName);
return tcs.Task;
}
public Task<CustomerDetails> GetCustomerDetails(int customerId)
{
var tcs = new TaskCompletionSource<CustomerDetails>();
var client = new MyClient();
client.GetCustomerDetailsCompleted += (object sender, GetCustomerDetailsCompletedEventArgs e) =>
{
var result = new CustomerDetails();
result.Name = e.Name;
tcs.SetResult(result);
}
client.GetCustomerDetailsAsync(customerId);
return tcs.Task;
}
}
Then you call this by:
var api = new ApiWrapper();
var findCustomersResult = await api.FindCustomers("Jones");
This would fail if any request fails.
PS. I wrote this example in notepad, so bear with me if it does not compiles or contains syntax errors. :)
I am developing an Windows Phone Application and I am stuck at a part.
My project is in c#/xaml - VS2013.
Problem :
I have a listpicker (Name - UserPicker) which is list of all user's names. Now I Want to get the UserID from the database for that UserName. I have implemented Web Api and I am using Json for deserialization.
But I am not able to return the String from DownloadCompleted event.
Code:
string usid = "";
selecteduser = (string)UserPicker.SelectedItem;
string uri = "http://localhost:1361/api/user";
WebClient client = new WebClient();
client.Headers["Accept"] = "application/json";
client.DownloadStringAsync(new Uri(uri));
//client.DownloadStringCompleted += client_DownloadStringCompleted;
client.DownloadStringCompleted += (s1, e1) =>
{
//var data = JsonConvert.DeserializeObject<Chore[]>(e1.Result.ToString());
//MessageBox.Show(data.ToString());
var user = JsonConvert.DeserializeObject<User[]>(e1.Result.ToString());
foreach (User u in user)
{
if (u.UName == selecteduser)
{
usid = u.UserID;
}
//result.Add(c);
return usid;
}
//return usid
};
I want to return the UserID of the selected user. But Its Giving me following errors.
Since 'System.Net.DownloadStringCompletedEventHandler' returns void, a return keyword must not be followed by an object expression
Cannot convert lambda expression to delegate type 'System.Net.DownloadStringCompletedEventHandler' because some of the return types in the block are not implicitly convertible to the delegate return type
If you check source code of DownloadStringCompletedEventHandler you will see that it is implemented like that:
public delegate void DownloadStringCompletedEventHandler(
object sender, DownloadStringCompletedEventArgs e);
That means that you can't return any data from it. You probably have some method that does something with selected user id. You will need to call this method from event handler. So if this method is named HandleSelectedUserId, then code might look like that:
client.DownloadStringCompleted += (sender, e) =>
{
string selectedUserId = null;
var users = JsonConvert.DeserializeObject<User[]>(e.Result.ToString());
foreach (User user in users)
{
if (user.UName == selecteduser)
{
selectedUserId = user.UserID;
break;
}
}
HandleSelectedUserId(selectedUserId);
};
client.DownloadStringAsync(new Uri("http://some.url"));
It's also a good idea to add event handler for DownloadStringCompleted event before you call DownloadStringAsync method.
I really hope there's someone experienced enough both with TPL & System.Net Classes and methods
What started as a simple thought of use TPL on current sequential set of actions led me to a halt in my project.
As I am still fresh With .NET, jumping straight to deep water using TPL ...
I was trying to extract an Aspx page's source/content(html) using WebClient
Having multiple requests per day (around 20-30 pages to go through) and extract specific values out of the source code... being only one of few daily tasks the server has on its list,
Led me to try implement it by using TPL, thus gain some speed.
Although I tried using Task.Factory.StartNew() trying to iterate on few WC instances ,
on first try execution of WC the application just does not get any result from the WebClient
This is my last try on it
static void Main(string[] args)
{
EnumForEach<Act>(Execute);
Task.WaitAll();
}
public static void EnumForEach<Mode>(Action<Mode> Exec)
{
foreach (Mode mode in Enum.GetValues(typeof(Mode)))
{
Mode Curr = mode;
Task.Factory.StartNew(() => Exec(Curr) );
}
}
string ResultsDirectory = Environment.CurrentDirectory,
URL = "",
TempSourceDocExcracted ="",
ResultFile="";
enum Act
{
dolar, ValidateTimeOut
}
void Execute(Act Exc)
{
switch (Exc)
{
case Act.dolar:
URL = "http://www.AnyDomainHere.Com";
ResultFile =ResultsDirectory + "\\TempHtm.htm";
TempSourceDocExcracted = IeNgn.AgilityPacDocExtraction(URL).GetElementbyId("Dv_Main").InnerHtml;
File.WriteAllText(ResultFile, TempSourceDocExcracted);
break;
case Act.ValidateTimeOut:
URL = "http://www.AnotherDomainHere.Com";
ResultFile += "\\TempHtm.htm";
TempSourceDocExcracted = IeNgn.AgilityPacDocExtraction(URL).GetElementbyId("Dv_Main").InnerHtml;
File.WriteAllText(ResultFile, TempSourceDocExcracted);
break;
}
//usage of HtmlAgilityPack to extract Values of elements by their attributes/properties
public HtmlAgilityPack.HtmlDocument AgilityPacDocExtraction(string URL)
{
using (WC = new WebClient())
{
WC.Proxy = null;
WC.Encoding = Encoding.GetEncoding("UTF-8");
tmpExtractedPageValue = WC.DownloadString(URL);
retAglPacHtmDoc.LoadHtml(tmpExtractedPageValue);
return retAglPacHtmDoc;
}
}
What am I doing wrong? Is it possible to use a WebClient using TPL at all or should I use another tool (not being able to use IIS 7 / .net4.5)?
I see at least several issues:
naming - FlNm is not a name - VisualStudio is modern IDE with smart code completion, there's no need to save keystrokes (you may start here, there are alternatives too, main thing is too keep it consistent: C# Coding Conventions.
If you're using multithreading, you need to care about resource sharing. For example FlNm is a static string and it is assigned inside each thread, so it's value is not deterministic (also even if it was running sequentially, code would work faulty - you would adding file name in path in each iteration, so it would be like c:\TempHtm.htm\TempHtm.htm\TempHtm.htm)
You're writing to the same file from different threads (well, at least that was your intent I think) - usually that's a recipe for disaster in multithreading. Question is, if you need at all write anything to disk, or it can be downloaded as string and parsed without touching disk - there's a good example what does it mean to touch a disk.
Overall I think you should parallelize only downloading, so do not involve HtmlAgilityPack in multithreading, as I think you don't know it is thread safe. On the other hand, downloading will have good performance/thread count ratio, html parsing - not so much, may be if thread count will be equal to cores count, but not more. Even more - I would separate downloading and parsing, as it would be easier to test, understand and maintain.
Update: I don't understand your full intent, but this may help you started (it's not production code, you should add retry/error catching, etc.).
Also at the end is extended WebClient class allowing you to get more threads spinning, because by default webclient allows only two connections.
class Program
{
static void Main(string[] args)
{
var urlList = new List<string>
{
"http://google.com",
"http://yahoo.com",
"http://bing.com",
"http://ask.com"
};
var htmlDictionary = new ConcurrentDictionary<string, string>();
Parallel.ForEach(urlList, new ParallelOptions { MaxDegreeOfParallelism = 20 }, url => Download(url, htmlDictionary));
foreach (var pair in htmlDictionary)
{
Process(pair);
}
}
private static void Process(KeyValuePair<string, string> pair)
{
// do the html processing
}
private static void Download(string url, ConcurrentDictionary<string, string> htmlDictionary)
{
using (var webClient = new SmartWebClient())
{
htmlDictionary.TryAdd(url, webClient.DownloadString(url));
}
}
}
public class SmartWebClient : WebClient
{
private readonly int maxConcurentConnectionCount;
public SmartWebClient(int maxConcurentConnectionCount = 20)
{
this.maxConcurentConnectionCount = maxConcurentConnectionCount;
}
protected override WebRequest GetWebRequest(Uri address)
{
var httpWebRequest = (HttpWebRequest)base.GetWebRequest(address);
if (httpWebRequest == null)
{
return null;
}
if (maxConcurentConnectionCount != 0)
{
httpWebRequest.ServicePoint.ConnectionLimit = maxConcurentConnectionCount;
}
return httpWebRequest;
}
}
I am working on an app that searches for email addresses in Google search results' URLs. The problem is it needs to return the value it found in each page + the URL in which it found the email, to a datagridview with 2 columns: Email and URL.
I am using Parallel.ForEach for this one but of course it returns random URLs and not the ones it really found the email on.
public static string htmlcon; //htmlsource
public static List<string> emailList = new List<string>();
public static string Get(string url, bool proxy)
{
htmlcon = "";
try
{
HttpWebRequest req = (HttpWebRequest)WebRequest.Create(url);
if (proxy)
req.Proxy = new WebProxy(proxyIP + ":" + proxyPort);
req.Method = "GET";
req.UserAgent = Settings1.Default.UserAgent;
if (Settings1.Default.EnableCookies == true)
{
CookieContainer cont = new CookieContainer();
req.CookieContainer = cont;
}
WebResponse resp = req.GetResponse();
StreamReader SR = new StreamReader(resp.GetResponseStream());
htmlcon = SR.ReadToEnd();
Thread.Sleep(400);
resp.Close();
SR.Close();
}
catch (Exception)
{
Thread.Sleep(500);
}
return htmlcon;
}
private void copyMails(string url)
{
string emailPat = #"(\b[a-zA-Z0-9._%-]+#[a-zA-Z0-9.-]+\.[a-zA-Z]{2,4}\b)";
MatchCollection mailcol = Regex.Matches(htmlcon, emailPat, RegexOptions.Singleline);
foreach (Match mailMatch in mailcol)
{
email = mailMatch.Groups[1].Value;
if (!emailList.Contains(email))
{
emailList.Add(email);
Action dgeins = () => mailDataGrid.Rows.Insert(0, email, url);
mailDataGrid.BeginInvoke(dgeins);
}
}
}
private void SEbgWorker_DoWork(object sender, DoWorkEventArgs e)
{
//ALOT OF IRRELEVAMT STUFF BEING RUN
Parallel.ForEach(allSElist.OfType<string>(), (s) =>
{
//Get URL
Get(s, Settings1.Default.Proxyset);
//match mails 1st page
copyMails(s);
});
}
so this is it: I execute a Get request(where "s" is the URL from the list) and then execute copyMails(s) from the URL's html source. It uses regex to copy the emails.
If I do it without parallel it returns the correct URL for each email in the datagridview. How can I do this parallel an still get the correct match in the datagridview?
Thanks
You would be better off using PLINQ's Where to filter (pseudo code):
var results = from i in input.AsParallel()
let u = get the URL from i
let d = get the data from u
let v = try get the value from d
where v is found
select new {
Url = u,
Value = v
};
Underneath the AsParallel means that TPL's implementation of LINQ operators (Select, Where, ...) is used.
UPDATE: Now with more information
First there are a number of issues in your code:
The variable htmlcon is static but used directly by multiple threads. This could well be your underlying problem. Consider just two input values. The first Get completes setting htmlcon, before that thread's call to copyMails starts the second thread's Get completes its HTML GET and writes to htmlcon. With `email
The list emailList is also accessed without locking by multiple threads. Most collection types in .NET (and any other programming platform) are not thread safe, you need to limit access to a single thread at a time.
You are mixing up various activities in each of your methods. Consider applying the singe responsibility principle.
Thread.Sleep to handle an exception?! If you can't handle an exception (ie. resolve the condition) then do nothing. In this case if the action throws then the Parallel.Foreach will throw: that'll do until you define how to handle the HTML GET failing.
Three suggestions:
In my experience clean code (to an obsessive degree) makes things easier: the details of the format
don't matter (one true brace style is better, but consistency is the key). Just going through
and cleaning up the formatting showed up issues #1 and #2.
Good naming. Don't abbreviate anything used over more than a few lines of code unless that is a
significant term for the domain. Eg. s for the action parameter in the parallel loop is really a url
so call it that. This kind of thing immediately makes the code easier to follow.
Think about that regex for emails: there are many valid emails that will not match (eg. use of + to provide multiple logical addresses: exmaple+one#gamil.com will be delivered to example#gmail.com and can then be used for local rules). Also an apostrophe ("'") is a valid character (and known people frustrated by web sites that refused their addresses by getting this wrong).
Second: A relatively direct clean up:
public static string Get(string url, bool proxy) {
HttpWebRequest req = (HttpWebRequest)WebRequest.Create(url);
if (proxy) {
req.Proxy = new WebProxy(proxyIP + ":" + proxyPort);
}
req.Method = "GET";
req.UserAgent = Settings1.Default.UserAgent;
if (Settings1.Default.EnableCookies == true) {
CookieContainer cont = new CookieContainer();
req.CookieContainer = cont;
}
using (WebResponse resp = req.GetResponse())
using (StreamReader SR = new StreamReader(resp.GetResponseStream())) {
return SR.ReadToEnd();
}
}
private static Regex emailMatcher = new Regex(#"(\b[a-zA-Z0-9._%-]+#[a-zA-Z0-9.-]+\.[a-zA-Z]{2,4}\b)", RegexOptions.Singleline);
private static string[] ExtractEmails(string htmlContent) {
return emailMatcher.Matches(htmlContent).OfType<Match>
.Select(m => m.Groups[1].Value)
.Distinct()
.ToArray();
}
private void SEbgWorker_DoWork(object sender, DoWorkEventArgs e) {
Parallel.ForEach(allSElist.OfType<string>(), url => {
var htmlContent = Get(url, Settings1.Default.Proxyset);
var emails = ExtractEmails(htmlContent);
foreach (var email in emails) {
Action dgeins = () => mailDataGrid.Rows.Insert(0, email, url);
mailDataGrid.BeginInvoke(dgeins);
}
}
Here I have:
Made use of using statements to automate the cleanup of resources.
Eliminated all mutable shared state.
Regex is explicitly documented to have thread safe instance methods. So I only need a single instance.
Removed noise: no need to pass the URL to ExtractEmails because the extraction doesn't use the URL.
Get now only performs the HTML get, ExtreactEMail just the extraction
Third: The above will block threads on the slowest operation: the HTML GET.
The real concurrency benefit would be to replace HttpWebRequest.GetResponse and reading the response stream with their asynchronous equivalents.
Using Task would be the answer in .NET 4, but you need to directly work with Stream and encoding yourself because StreamReader doesn't provide any BeginABC/EndABC method pairs. But .NET 4.5 is almost here, so apply some async/await:
Nothing to do in ExtractEMails.
Get is now asynchronous, blocking in neither the HTTP GET or reading the result.
SEbgWorker_DoWork uses Tasks directly to avoid mixing too many different ways to work with TPL. Since Get returns a Task<string> can simple continue (when it hasn't failed – unless you specify otherwise ContinueWith will only continue if the previous task has completed successfully):
This should work in .NET 4.5, but without a set of valid URLs for which this will work I cannot test.
public static async Task<string> Get(string url, bool proxy) {
HttpWebRequest req = (HttpWebRequest)WebRequest.Create(url);
if (proxy) {
req.Proxy = new WebProxy(proxyIP + ":" + proxyPort);
}
req.Method = "GET";
req.UserAgent = Settings1.Default.UserAgent;
if (Settings1.Default.EnableCookies == true) {
CookieContainer cont = new CookieContainer();
req.CookieContainer = cont;
}
using (WebResponse resp = await req.GetResponseAsync())
using (StreamReader SR = new StreamReader(resp.GetResponseStream())) {
return await SR.ReadToEndAsync();
}
}
private static Regex emailMatcher = new Regex(#"(\b[a-zA-Z0-9._%-]+#[a-zA-Z0-9.-]+\.[a-zA-Z]{2,4}\b)", RegexOptions.Singleline);
private static string[] ExtractEmails(string htmlContent) {
return emailMatcher.Matches(htmlContent).OfType<Match>
.Select(m => m.Groups[1].Value)
.Distinct()
.ToArray();
}
private void SEbgWorker_DoWork(object sender, DoWorkEventArgs e) {
tasks = allSElist.OfType<string>()
.Select(url => {
return Get(url, Settings1.Default.Proxyset)
.ContinueWith(htmlContentTask => {
// No TaskContinuationOptions, so know always OK here
var htmlContent = htmlContentTask.Result;
var emails = ExtractEmails(htmlContent);
foreach (var email in emails) {
// No InvokeAsync on WinForms, so do this the old way.
Action dgeins = () => mailDataGrid.Rows.Insert(0, email, url);
mailDataGrid.BeginInvoke(dgeins);
}
});
});
tasks.WaitAll();
}
public static string htmlcon; //htmlsource
public static List emailList = new List();
Problem is because these members htmlcon and emailList are shared resource among thread and among iterations. Each your iteration in Parallel.ForEach is executed parallel. Thats why you have strange behaviour.
How to solve problem:
Modify your code and try to implement it without static variables or shared state.
As an option is change from Parallel.ForEach to TPL Task chaining, when you make this change then result of one parallel operation will be input data for other and it's as an options among many how to modify code to avoid shared state.
Use locking or concurrent collections. Your htmlcon variable could be made volatile but with list you should yous lock's or concurrent collections.
Better way is modify your code to avoid shared state, and how to do that are many options based on your implementation, not only task chaining.