I am working on an app that searches for email addresses in Google search results' URLs. The problem is it needs to return the value it found in each page + the URL in which it found the email, to a datagridview with 2 columns: Email and URL.
I am using Parallel.ForEach for this one but of course it returns random URLs and not the ones it really found the email on.
public static string htmlcon; //htmlsource
public static List<string> emailList = new List<string>();
public static string Get(string url, bool proxy)
{
htmlcon = "";
try
{
HttpWebRequest req = (HttpWebRequest)WebRequest.Create(url);
if (proxy)
req.Proxy = new WebProxy(proxyIP + ":" + proxyPort);
req.Method = "GET";
req.UserAgent = Settings1.Default.UserAgent;
if (Settings1.Default.EnableCookies == true)
{
CookieContainer cont = new CookieContainer();
req.CookieContainer = cont;
}
WebResponse resp = req.GetResponse();
StreamReader SR = new StreamReader(resp.GetResponseStream());
htmlcon = SR.ReadToEnd();
Thread.Sleep(400);
resp.Close();
SR.Close();
}
catch (Exception)
{
Thread.Sleep(500);
}
return htmlcon;
}
private void copyMails(string url)
{
string emailPat = #"(\b[a-zA-Z0-9._%-]+#[a-zA-Z0-9.-]+\.[a-zA-Z]{2,4}\b)";
MatchCollection mailcol = Regex.Matches(htmlcon, emailPat, RegexOptions.Singleline);
foreach (Match mailMatch in mailcol)
{
email = mailMatch.Groups[1].Value;
if (!emailList.Contains(email))
{
emailList.Add(email);
Action dgeins = () => mailDataGrid.Rows.Insert(0, email, url);
mailDataGrid.BeginInvoke(dgeins);
}
}
}
private void SEbgWorker_DoWork(object sender, DoWorkEventArgs e)
{
//ALOT OF IRRELEVAMT STUFF BEING RUN
Parallel.ForEach(allSElist.OfType<string>(), (s) =>
{
//Get URL
Get(s, Settings1.Default.Proxyset);
//match mails 1st page
copyMails(s);
});
}
so this is it: I execute a Get request(where "s" is the URL from the list) and then execute copyMails(s) from the URL's html source. It uses regex to copy the emails.
If I do it without parallel it returns the correct URL for each email in the datagridview. How can I do this parallel an still get the correct match in the datagridview?
Thanks
You would be better off using PLINQ's Where to filter (pseudo code):
var results = from i in input.AsParallel()
let u = get the URL from i
let d = get the data from u
let v = try get the value from d
where v is found
select new {
Url = u,
Value = v
};
Underneath the AsParallel means that TPL's implementation of LINQ operators (Select, Where, ...) is used.
UPDATE: Now with more information
First there are a number of issues in your code:
The variable htmlcon is static but used directly by multiple threads. This could well be your underlying problem. Consider just two input values. The first Get completes setting htmlcon, before that thread's call to copyMails starts the second thread's Get completes its HTML GET and writes to htmlcon. With `email
The list emailList is also accessed without locking by multiple threads. Most collection types in .NET (and any other programming platform) are not thread safe, you need to limit access to a single thread at a time.
You are mixing up various activities in each of your methods. Consider applying the singe responsibility principle.
Thread.Sleep to handle an exception?! If you can't handle an exception (ie. resolve the condition) then do nothing. In this case if the action throws then the Parallel.Foreach will throw: that'll do until you define how to handle the HTML GET failing.
Three suggestions:
In my experience clean code (to an obsessive degree) makes things easier: the details of the format
don't matter (one true brace style is better, but consistency is the key). Just going through
and cleaning up the formatting showed up issues #1 and #2.
Good naming. Don't abbreviate anything used over more than a few lines of code unless that is a
significant term for the domain. Eg. s for the action parameter in the parallel loop is really a url
so call it that. This kind of thing immediately makes the code easier to follow.
Think about that regex for emails: there are many valid emails that will not match (eg. use of + to provide multiple logical addresses: exmaple+one#gamil.com will be delivered to example#gmail.com and can then be used for local rules). Also an apostrophe ("'") is a valid character (and known people frustrated by web sites that refused their addresses by getting this wrong).
Second: A relatively direct clean up:
public static string Get(string url, bool proxy) {
HttpWebRequest req = (HttpWebRequest)WebRequest.Create(url);
if (proxy) {
req.Proxy = new WebProxy(proxyIP + ":" + proxyPort);
}
req.Method = "GET";
req.UserAgent = Settings1.Default.UserAgent;
if (Settings1.Default.EnableCookies == true) {
CookieContainer cont = new CookieContainer();
req.CookieContainer = cont;
}
using (WebResponse resp = req.GetResponse())
using (StreamReader SR = new StreamReader(resp.GetResponseStream())) {
return SR.ReadToEnd();
}
}
private static Regex emailMatcher = new Regex(#"(\b[a-zA-Z0-9._%-]+#[a-zA-Z0-9.-]+\.[a-zA-Z]{2,4}\b)", RegexOptions.Singleline);
private static string[] ExtractEmails(string htmlContent) {
return emailMatcher.Matches(htmlContent).OfType<Match>
.Select(m => m.Groups[1].Value)
.Distinct()
.ToArray();
}
private void SEbgWorker_DoWork(object sender, DoWorkEventArgs e) {
Parallel.ForEach(allSElist.OfType<string>(), url => {
var htmlContent = Get(url, Settings1.Default.Proxyset);
var emails = ExtractEmails(htmlContent);
foreach (var email in emails) {
Action dgeins = () => mailDataGrid.Rows.Insert(0, email, url);
mailDataGrid.BeginInvoke(dgeins);
}
}
Here I have:
Made use of using statements to automate the cleanup of resources.
Eliminated all mutable shared state.
Regex is explicitly documented to have thread safe instance methods. So I only need a single instance.
Removed noise: no need to pass the URL to ExtractEmails because the extraction doesn't use the URL.
Get now only performs the HTML get, ExtreactEMail just the extraction
Third: The above will block threads on the slowest operation: the HTML GET.
The real concurrency benefit would be to replace HttpWebRequest.GetResponse and reading the response stream with their asynchronous equivalents.
Using Task would be the answer in .NET 4, but you need to directly work with Stream and encoding yourself because StreamReader doesn't provide any BeginABC/EndABC method pairs. But .NET 4.5 is almost here, so apply some async/await:
Nothing to do in ExtractEMails.
Get is now asynchronous, blocking in neither the HTTP GET or reading the result.
SEbgWorker_DoWork uses Tasks directly to avoid mixing too many different ways to work with TPL. Since Get returns a Task<string> can simple continue (when it hasn't failed – unless you specify otherwise ContinueWith will only continue if the previous task has completed successfully):
This should work in .NET 4.5, but without a set of valid URLs for which this will work I cannot test.
public static async Task<string> Get(string url, bool proxy) {
HttpWebRequest req = (HttpWebRequest)WebRequest.Create(url);
if (proxy) {
req.Proxy = new WebProxy(proxyIP + ":" + proxyPort);
}
req.Method = "GET";
req.UserAgent = Settings1.Default.UserAgent;
if (Settings1.Default.EnableCookies == true) {
CookieContainer cont = new CookieContainer();
req.CookieContainer = cont;
}
using (WebResponse resp = await req.GetResponseAsync())
using (StreamReader SR = new StreamReader(resp.GetResponseStream())) {
return await SR.ReadToEndAsync();
}
}
private static Regex emailMatcher = new Regex(#"(\b[a-zA-Z0-9._%-]+#[a-zA-Z0-9.-]+\.[a-zA-Z]{2,4}\b)", RegexOptions.Singleline);
private static string[] ExtractEmails(string htmlContent) {
return emailMatcher.Matches(htmlContent).OfType<Match>
.Select(m => m.Groups[1].Value)
.Distinct()
.ToArray();
}
private void SEbgWorker_DoWork(object sender, DoWorkEventArgs e) {
tasks = allSElist.OfType<string>()
.Select(url => {
return Get(url, Settings1.Default.Proxyset)
.ContinueWith(htmlContentTask => {
// No TaskContinuationOptions, so know always OK here
var htmlContent = htmlContentTask.Result;
var emails = ExtractEmails(htmlContent);
foreach (var email in emails) {
// No InvokeAsync on WinForms, so do this the old way.
Action dgeins = () => mailDataGrid.Rows.Insert(0, email, url);
mailDataGrid.BeginInvoke(dgeins);
}
});
});
tasks.WaitAll();
}
public static string htmlcon; //htmlsource
public static List emailList = new List();
Problem is because these members htmlcon and emailList are shared resource among thread and among iterations. Each your iteration in Parallel.ForEach is executed parallel. Thats why you have strange behaviour.
How to solve problem:
Modify your code and try to implement it without static variables or shared state.
As an option is change from Parallel.ForEach to TPL Task chaining, when you make this change then result of one parallel operation will be input data for other and it's as an options among many how to modify code to avoid shared state.
Use locking or concurrent collections. Your htmlcon variable could be made volatile but with list you should yous lock's or concurrent collections.
Better way is modify your code to avoid shared state, and how to do that are many options based on your implementation, not only task chaining.
Related
I have this function that can get up to 10 items as an input list
public async Task<KeyValuePair<string, bool>[]> PayCallSendSMS(List<SmsRequest> ListSms)
{
List<Task<KeyValuePair<string, bool>>> tasks = new List<Task<KeyValuePair<string, bool>>>();
foreach (SmsRequest sms in ListSms)
{
tasks.Add(Task.Run(() => SendSMS(sms)));
}
var result = await Task.WhenAll(tasks);
return result;
}
and in this function, i await for some JSON to be downloaded and after it's done in deserialize it.
public async Task<KeyValuePair<string, bool>> SendSMS(SmsRequest sms)
{
//some code
using (WebResponse response = webRequest.GetResponse())
{
using (Stream responseStream = response.GetResponseStream())
{
StreamReader rdr = new StreamReader(responseStream, Encoding.UTF8);
string Json = await rdr.ReadToEndAsync();
deserializedJsonDictionary = (Dictionary<string, object>)jsonSerializer.DeserializeObject(Json);
}
}
//some code
return GetResult(sms.recipient);
}
public KeyValuePair<string, bool> GetResult(string recipient)
{
if (deserializedJsonDictionary[STATUS].ToString().ToLower().Equals("true"))
{
return new KeyValuePair<string, bool>(recipient, true);
}
else // deserializedJsonDictionary[STATUS] == "False"
{
return new KeyValuePair<string, bool>(recipient, false);
}
}
My problem is in the return GetResult(); part in which deserializedJsonDictionary is null(and ofc it is becuase the json havent done downloading).
but I don't know how to solve it
I tried to use ContinueWith but it doesn't work for me.
I'm willing to accept any change to my original code and/or the design of the solution
Unrelated tip: Don't abuse KeyValuePair<>, use C# 7 value-tuples instead (not least because they're much easier to read).
Using a foreach loop to build a List<Task> is fine - though it can be more succint to use .Select() instead. I use this approach in my answer.
But don't use Task.Run with the ancient WebRequest (HttpWebRequest) type. Instead use HttpClient which has full support for async IO.
Also, you should conform to the .NET naming-convention:
All methods that are async should have Async has a method-name suffix (e.g. PayCallSendSMS should be named PayCallSendSmsAsync).
Acronyms and initialisms longer than 2 characters should be in PascalCase, not CAPS, so use Sms instead of SMS.
Use camelCase, not PascalCase for parameters and locals - and List is a redundant prefix. A better name for ListSms would be smsRequests as its type is List<SmsRequest>).
Generally speaking, parameters should be declared using the least-specific type required - especially collection parameters, consider typing them as IEnumerable<T> or IReadOnlyCollection<T> instead of T[], List<T>, and so on).
You need to first check that the response from the remote server actually is a JSON response (instead of a HTML error message or XML response) and has the expected status code - otherwise you'll be trying to deserialize something that is not JSON.
Consider supporting CancellationToken too (this is not included in my answer as it adds too much visual noise).
Always use Dictionary.TryGetValue instead of blindly assuming the dictionary indexer will match.
public async Task< IReadOnlyList<(String recipient, Boolean ok)> > PayCallSendSmsAsync( IEnumerable<SmsRequest> smsRequests )
{
using( HttpClient httpClient = this.httpClientFactory.Create() )
{
var tasks = smsRequests
.Select(r => SendSmsAsync(httpClient, r))
.ToList(); // <-- The call to ToList is important as it materializes the list and triggers all of the Tasks.
(String recipient, Boolean ok)[] results = await Task.WhenAll(tasks);
return results;
}
}
private static async Task<(String recipient, Boolean ok)> SendSmsAsync(HttpClient httpClient, SmsRequest smsRequest)
{
using (HttpRequestMessage request = new HttpRequestMessage( ... ) )
using (HttpResponseMessage response = await httpClient.SendAsync(request).ConfigureAwait(false))
{
String responseType = response.Content.Headers.ContentType?.MediaType ?? "";
if (responseType != "application/json" || response.StatusCode != HttpStatusCode.OK)
{
throw new InvalidOperationException("Expected HTTP 200 JSON response but encountered an HTTP " + response.StatusCode + " " + responseType + " response instead." );
}
String jsonText = await response.Content.ReadAsStringAsync().ConfigureAwait(false);
Dictionary<String,Object> dict = JsonConvert.DeserializeObject< Dictionary<String,Object> >(jsonText);
if(
dict != null &&
dict.TryGetValue(STATUS, out Object statusValue) &&
statusValue is String statusStr &&
"true".Equals( statusStr, StringComparison.OrdinalIgnoreCase )
)
{
return ( smsRequest.Recipient, ok: true );
}
else
{
return ( smsRequest.Recipient, ok: false );
}
}
}
For one of my projects I want to develop a library that can be used in different platforms (Desktop, Mobile, Surface, etc). Hence have opted Porable Class Library.
I am developing a class for calling different API calls' using HttpClient. I am stuck with how to call the method, response and work around. This is my code :-
public static async Task<JObject> ExecuteGet(string uri)
{
using (HttpClient client = new HttpClient())
{
// TODO - Send HTTP requests
HttpRequestMessage reqMsg = new HttpRequestMessage(HttpMethod.Get, uri);
reqMsg.Headers.Add(apiIdTag, apiIdKey);
reqMsg.Headers.Add(apiSecretTag, ApiSecret);
reqMsg.Headers.Add("Content-Type", "text/json");
reqMsg.Headers.Add("Accept", "application/json");
//response = await client.SendAsync(reqMsg);
//return response;
//if (response.IsSuccessStatusCode)
//{
string content = await response.Content.ReadAsStringAsync();
return (JObject.Parse(content));
//}
}
}
// Perform AGENT LOGIN Process
public static bool agentStatus() {
bool loginSuccess = false;
try
{
API_Utility.ExecuteGet("http://api.mintchat.com/agent/autoonline").Wait();
// ACCESS Response, JObject ???
}
catch
{
}
finally
{
}
Like ExecuteGet, I will also create for ExecutePost. My query is from ExecuteGet, if (1) I pass JObject on parsing when IsSuccessStatusCode only, then how can I know about any other errors or messages to inform the user. (2) If I pass response, then how do I assign it here
response = API_Utility.ExecuteGet("http://api.mintchat.com/agent/autoonline").Wait();
that is giving error.
What would be the best approach to handle this situation ? And I got to call multiple API's, so different API will have different result sets.
Also, can you confirm that designing this way and adding PCL reference I will be able to access in multiple projects.
UPDATE :-
As mentioned in below 2 answers I have updated my code. As mentioned in the provided link I am calling the from the other project. This is my code :-
Portable Class Library :-
private static HttpRequestMessage getGetRequest(string url)
{
HttpRequestMessage reqMsg = new HttpRequestMessage(HttpMethod.Get, url);
reqMsg.Headers.Add(apiIdTag, apiIdKey);
reqMsg.Headers.Add(apiSecretTag, ApiSecret);
reqMsg.Headers.Add("Content-Type", "text/json");
reqMsg.Headers.Add("Accept", "application/json");
return reqMsg;
}
// Perform AGENT LOGIN Process
public static async Task<bool> agentStatus() {
bool loginSuccess = false;
HttpClient client = null;
HttpRequestMessage request = null;
try
{
client = new HttpClient();
request = getGetRequest("http://api.mintchat.com/agent/autoonline");
response = await client.SendAsync(request).ConfigureAwait(false);
if (response.IsSuccessStatusCode)
{
string content = await response.Content.ReadAsStringAsync().ConfigureAwait(false);
JObject o = JObject.Parse(content);
bool stat = bool.Parse(o["status"].ToString());
///[MainAppDataObject sharedAppDataObject].authLogin.chatStatus = str;
o = null;
}
loginSuccess = true;
}
catch
{
}
finally
{
request = null;
client = null;
response = null;
}
return loginSuccess;
}
From the other WPF project, in a btn click event I am calling this as follows :-
private async void btnSignin_Click(object sender, RoutedEventArgs e)
{
/// Other code goes here
// ..........
agent = doLogin(emailid, encPswd);
if (agent != null)
{
//agent.OnlineStatus = getAgentStatus();
// Compile Error at this line
bool stat = await MintWinLib.Helpers.API_Utility.agentStatus();
...
I get these 4 errors :-
Error 1 Predefined type 'System.Runtime.CompilerServices.IAsyncStateMachine' is not defined or imported D:\...\MiveChat\CSC
Error 2 The type 'System.Threading.Tasks.Task`1<T0>' is defined in an assembly that is not referenced. You must add a reference to assembly 'System.Threading.Tasks, Version=1.5.11.0, Culture=neutral, PublicKeyToken=b03f5f7f89d50a3a'. D:\...\Login Form.xaml.cs 97 21
Error 3 Cannot find all types required by the 'async' modifier. Are you targeting the wrong framework version, or missing a reference to an assembly? D:\...\Login Form.xaml.cs 97 33
Error 4 Cannot find all types required by the 'async' modifier. Are you targeting the wrong framework version, or missing a reference to an assembly? D:\...\Login Form.xaml.cs 47 28
I tried adding System.Threading.Tasks from the PCL library only, that gave 7 different errors. Where am I going wrong ? What to do to make this working ?
Please guide me on this. Have spend lots of hours figuring the best to develop a library accessible to desktop app & Win Phone app.
Any help is highly appreciative. Thanks.
If you call an async api when making the http calls, you should also expose that async endpoint to the user, and not block the request using Task.Wait.
Also, when creating a third party library, it is recommanded to use ConfigureAwait(false) to avoid deadlocks when the calling code tries to access the Result property or the Wait method. You should also follow guidelines and mark any async method with Async, so the method should be called ExecuteStatusAsync
public static Task<bool> AgentStatusAsync()
{
bool loginSuccess = false;
try
{
// awaiting the task will unwrap it and return the JObject
var jObject = await API_Utility.ExecuteGet("http://api.mintchat.com/agent/autoonline").ConfigureAwait(false);
}
catch
{
}
}
And inside ExecuteGet:
response = await client.SendAsync(reqMsg).ConfigureAwait(false);
string content = await response.Content.ReadAsStringAsync().ConfigureAwait(false);
In case IsSuccessStatusCode is false, you may throw an exception to the calling code to show something went wrong. To do that, you can use the HttpResponseMessage.EnsureSuccessStatusCode which throws an exception if the status code != 200 OK.
Personally, if ExecuteGet is a public API method i would definitely not expose it as a JObject but a strongly typed type.
If you want the result of the task, you need to use the Result property:
var obj = API_Utility.ExecuteGet("http://api.mintchat.com/agent/autoonline").Result;
However, it's usually not a good idea to wait synchronously for an async method to complete, because it can cause deadlocks. The better approach is to await the method:
var obj = await API_Utility.ExecuteGet("http://api.mintchat.com/agent/autoonline");
Note that you need to make the calling method async as well:
public static async Task<bool> agentStatus()
Sync and async code don't play together very well, so async tends to propagate across the whole code base.
i am trying to be as thorough as i can in this post, as it is very important for me,
though the issue is very simple, and only by reading the title of this question, you can get the idea...
question is:
with healthy bandwidth (30mb Vdsl) available...
how is it possible to get multiple httpWebRequest for a single data / file ?,
so each reaquest,will download only a portion of the data
then when all instances have completed, all parts are joined back to one piece.
Code:
...what i have got working so far is same idea only each task =HttpWebRequest = different file,
so speedup is pure tasks parallelism rather acceleration of one download using multiple tasks/threads
as in my question.
see code below
the next part is only more detailed explantion and background on the subject...if you don't mind reading.
while i am still on a similar project that differ from this (in question)one,
in the way that it(see code below..) was trying to fetch as many different data sources for each of separated tasks(different downloads/files).
... so the speedup was gaind while each(task) does not have to wait for the former one to complete first before it get a chance to be executed .
what i am trying to do in this current-subjected question (having allmost everything ready in the code below) is actually targetting same url for same data,
so this time the speedup to gain is for the single-task - current download .
implementing same idea as in code below only this time let SmartWebClient target same url by
using multiple instances.
then (only theory for now) it will request partial content of data,
with multiple requests with each one of instances .
last issue is i need to "put puzle back to one peace"... another problem i need to find out about...
as you can see in this code , what i did not get to work on yet is only the data parsing/processing which i find to be very easy using htmlAgilityPack so no problem.
current code
main entry:
var htmlDictionary = urlsForExtraction.urlsConcrDict();
Parallel.ForEach(
urlList.Values,
new ParallelOptions { MaxDegreeOfParallelism = 20 },
url => Download(url, htmlDictionary)
);
foreach (var pair in htmlDictionary)
{
///Process(pair);
MessageBox.Show(pair.Value);
}
public class urlsForExtraction
{
const string URL_Dollar= "";
const string URL_UpdateUsersTimeOut="";
public ConcurrentDictionary<string, string> urlsConcrDict()
{
//need to find the syntax to extract fileds names so it would be possible to iterate on each instead of specying
ConcurrentDictionary<string, string> retDict = new Dictionary<string,string>();
retDict.TryAdd("URL_Dollar", "Any.Url.com");
retDict.TryAdd("URL_UpdateUserstbl", "http://bing.com");
return retDict;
}
}
/// <summary>
/// second Stage Class consumes the Dictionary of urls for extraction
/// then downloads Each via parallel for each using The Smart WeBClient! (download(); )
/// </summary>
public class InitConcurentHtmDictExtrct
{
private void Download(string url, ConcurrentDictionary<string, string> htmlDictionary)
{
using (var webClient = new SmartWebClient())
{
webClient.Encoding = Encoding.GetEncoding("UTF-8");
webClient.Proxy = null;
htmlDictionary.TryAdd(url, webClient.DownloadString(url));
}
}
private ConcurrentDictionary<string, string> htmlDictionary;
public ConcurrentDictionary<string, string> LoopOnUrlsVia_SmartWC(Dictionary<string, string> urlList)
{
htmlDictionary = new ConcurrentDictionary<string, string>();
Parallel.ForEach(
urlList.Values,
new ParallelOptions { MaxDegreeOfParallelism = 20 },
url => Download(url, htmlDictionary)
);
return htmlDictionary;
}
}
/// <summary>
/// the Extraction Process, done via "HtmlAgility pack"
/// easy usage to collect information within a given html Documnet via referencing elements attributes
/// </summary>
public class Results
{
public struct ExtracionParameters
{
public string FileNameToSave;
public string directoryPath;
public string htmlElementType;
}
public enum Extraction
{
ById, ByClassName, ByElementName
}
public void ExtractHtmlDict( ConcurrentDictionary<string, string> htmlResults, Extract By)
{
// helps with easy elements extraction from the page.
HtmlAttribute htAgPcAttrbs;
HtmlDocument HtmlAgPCDoc = new HtmlDocument();
/// will hold a name+content of each documnet-part that was aventually extracted
/// then from this container the build of the result page will be possible
Dictionary<string, HtmlDocument> dictResults = new Dictionary<string, HtmlDocument>();
foreach (KeyValuePair<string, string> htmlPair in htmlResults)
{
Process(htmlPair);
}
}
private static void Process(KeyValuePair<string, string> pair)
{
// do the html processing
}
}
public class SmartWebClient : WebClient
{
private readonly int maxConcurentConnectionCount;
public SmartWebClient(int maxConcurentConnectionCount = 20)
{
this.Proxy = null;
this.Encoding = Encoding.GetEncoding("UTF-8");
this.maxConcurentConnectionCount = maxConcurentConnectionCount;
}
protected override WebRequest GetWebRequest(Uri address)
{
var httpWebRequest = (HttpWebRequest)base.GetWebRequest(address);
if (httpWebRequest == null)
{
return null;
}
if (maxConcurentConnectionCount != 0)
{
httpWebRequest.ServicePoint.ConnectionLimit = maxConcurentConnectionCount;
}
return httpWebRequest;
}
}
}
this allows me to take advantage of good bandwith,
only i am far from the subjected solution, i will realy appriciate any clue on where to start .
If the server support what's wikipedia calls byte serving, you can multiplex a file download spawning multiple requests with a specific Range header value (using the AddRange method. See also How to download the data from the server discontinuously?). Most serious HTTP servers do support byte-range.
Here is some sample code that implements a parallel download of a file using byte range:
public static void ParallelDownloadFile(string uri, string filePath, int chunkSize)
{
if (uri == null)
throw new ArgumentNullException("uri");
// determine file size first
long size = GetFileSize(uri);
using (FileStream file = new FileStream(filePath, FileMode.Create, FileAccess.Write, FileShare.Write))
{
file.SetLength(size); // set the length first
object syncObject = new object(); // synchronize file writes
Parallel.ForEach(LongRange(0, 1 + size / chunkSize), (start) =>
{
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(uri);
request.AddRange(start * chunkSize, start * chunkSize + chunkSize - 1);
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
lock (syncObject)
{
using (Stream stream = response.GetResponseStream())
{
file.Seek(start * chunkSize, SeekOrigin.Begin);
stream.CopyTo(file);
}
}
});
}
}
public static long GetFileSize(string uri)
{
if (uri == null)
throw new ArgumentNullException("uri");
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(uri);
request.Method = "HEAD";
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
return response.ContentLength;
}
private static IEnumerable<long> LongRange(long start, long count)
{
long i = 0;
while (true)
{
if (i >= count)
{
yield break;
}
yield return start + i;
i++;
}
}
And sample usage:
private static void TestParallelDownload()
{
string uri = "http://localhost/welcome.png";
string fileName = Path.GetFileName(uri);
ParallelDownloadFile(uri, fileName, 10000);
}
PS: I'd be curious to know if it's really more interesting to do this parallel thing rather than to just use WebClient.DownloadFile... Maybe in slow network scenarios?
I really hope there's someone experienced enough both with TPL & System.Net Classes and methods
What started as a simple thought of use TPL on current sequential set of actions led me to a halt in my project.
As I am still fresh With .NET, jumping straight to deep water using TPL ...
I was trying to extract an Aspx page's source/content(html) using WebClient
Having multiple requests per day (around 20-30 pages to go through) and extract specific values out of the source code... being only one of few daily tasks the server has on its list,
Led me to try implement it by using TPL, thus gain some speed.
Although I tried using Task.Factory.StartNew() trying to iterate on few WC instances ,
on first try execution of WC the application just does not get any result from the WebClient
This is my last try on it
static void Main(string[] args)
{
EnumForEach<Act>(Execute);
Task.WaitAll();
}
public static void EnumForEach<Mode>(Action<Mode> Exec)
{
foreach (Mode mode in Enum.GetValues(typeof(Mode)))
{
Mode Curr = mode;
Task.Factory.StartNew(() => Exec(Curr) );
}
}
string ResultsDirectory = Environment.CurrentDirectory,
URL = "",
TempSourceDocExcracted ="",
ResultFile="";
enum Act
{
dolar, ValidateTimeOut
}
void Execute(Act Exc)
{
switch (Exc)
{
case Act.dolar:
URL = "http://www.AnyDomainHere.Com";
ResultFile =ResultsDirectory + "\\TempHtm.htm";
TempSourceDocExcracted = IeNgn.AgilityPacDocExtraction(URL).GetElementbyId("Dv_Main").InnerHtml;
File.WriteAllText(ResultFile, TempSourceDocExcracted);
break;
case Act.ValidateTimeOut:
URL = "http://www.AnotherDomainHere.Com";
ResultFile += "\\TempHtm.htm";
TempSourceDocExcracted = IeNgn.AgilityPacDocExtraction(URL).GetElementbyId("Dv_Main").InnerHtml;
File.WriteAllText(ResultFile, TempSourceDocExcracted);
break;
}
//usage of HtmlAgilityPack to extract Values of elements by their attributes/properties
public HtmlAgilityPack.HtmlDocument AgilityPacDocExtraction(string URL)
{
using (WC = new WebClient())
{
WC.Proxy = null;
WC.Encoding = Encoding.GetEncoding("UTF-8");
tmpExtractedPageValue = WC.DownloadString(URL);
retAglPacHtmDoc.LoadHtml(tmpExtractedPageValue);
return retAglPacHtmDoc;
}
}
What am I doing wrong? Is it possible to use a WebClient using TPL at all or should I use another tool (not being able to use IIS 7 / .net4.5)?
I see at least several issues:
naming - FlNm is not a name - VisualStudio is modern IDE with smart code completion, there's no need to save keystrokes (you may start here, there are alternatives too, main thing is too keep it consistent: C# Coding Conventions.
If you're using multithreading, you need to care about resource sharing. For example FlNm is a static string and it is assigned inside each thread, so it's value is not deterministic (also even if it was running sequentially, code would work faulty - you would adding file name in path in each iteration, so it would be like c:\TempHtm.htm\TempHtm.htm\TempHtm.htm)
You're writing to the same file from different threads (well, at least that was your intent I think) - usually that's a recipe for disaster in multithreading. Question is, if you need at all write anything to disk, or it can be downloaded as string and parsed without touching disk - there's a good example what does it mean to touch a disk.
Overall I think you should parallelize only downloading, so do not involve HtmlAgilityPack in multithreading, as I think you don't know it is thread safe. On the other hand, downloading will have good performance/thread count ratio, html parsing - not so much, may be if thread count will be equal to cores count, but not more. Even more - I would separate downloading and parsing, as it would be easier to test, understand and maintain.
Update: I don't understand your full intent, but this may help you started (it's not production code, you should add retry/error catching, etc.).
Also at the end is extended WebClient class allowing you to get more threads spinning, because by default webclient allows only two connections.
class Program
{
static void Main(string[] args)
{
var urlList = new List<string>
{
"http://google.com",
"http://yahoo.com",
"http://bing.com",
"http://ask.com"
};
var htmlDictionary = new ConcurrentDictionary<string, string>();
Parallel.ForEach(urlList, new ParallelOptions { MaxDegreeOfParallelism = 20 }, url => Download(url, htmlDictionary));
foreach (var pair in htmlDictionary)
{
Process(pair);
}
}
private static void Process(KeyValuePair<string, string> pair)
{
// do the html processing
}
private static void Download(string url, ConcurrentDictionary<string, string> htmlDictionary)
{
using (var webClient = new SmartWebClient())
{
htmlDictionary.TryAdd(url, webClient.DownloadString(url));
}
}
}
public class SmartWebClient : WebClient
{
private readonly int maxConcurentConnectionCount;
public SmartWebClient(int maxConcurentConnectionCount = 20)
{
this.maxConcurentConnectionCount = maxConcurentConnectionCount;
}
protected override WebRequest GetWebRequest(Uri address)
{
var httpWebRequest = (HttpWebRequest)base.GetWebRequest(address);
if (httpWebRequest == null)
{
return null;
}
if (maxConcurentConnectionCount != 0)
{
httpWebRequest.ServicePoint.ConnectionLimit = maxConcurentConnectionCount;
}
return httpWebRequest;
}
}
I have a function in my program that creates new widgets to represent data, however whenever a widget is created i get alot of "AutoRelease with no NSAutoReleasePool in place" error messages. Since an NSAutoReleasePool should be automatically created on the main thread, I have an inkling that these error messages appear because an async function might create my threads...
This is the function called to create widgets to represent the latest information. This function is called pretty often:
private void CreateAndDisplayTvShowWidget (TvShow show)
{
var Widget = new TvShowWidgetController (show);
Widget.OnRemoveWidget += ConfirmRemoveTvShow;
Widget.View.SetFrameOrigin (new PointF (0, -150));
Widget.View.SetFrameSize (new SizeF (ContentView.Frame.Width, 150));
ContentView.AddSubview (Widget.View);
show.ShowWidget = Widget;
}
This function is usually called when this async function returns:
private static void WebRequestCallback (IAsyncResult result)
{
HttpWebRequest request = (HttpWebRequest)result.AsyncState;
HttpWebResponse response = (HttpWebResponse)request.EndGetResponse (result);
StreamReader responseStream = new StreamReader (response.GetResponseStream ());
string responseString = responseStream.ReadToEnd ();
responseStream.Close ();
ProcessResponse (responseString, request);
}
ProcessResponse (responseString, request) looks like this:
private static void ProcessResponse (string responseString, HttpWebRequest request)
{
string requestUrl = request.Address.ToString ();
if (requestUrl.Contains (ShowSearchTag)) {
List<TvShow> searchResults = TvDbParser.ParseTvShowSearchResults (responseString);
TvShowSearchTimeoutClock.Enabled = false;
OnTvShowSearchComplete (searchResults);
} else if (requestUrl.Contains (MirrorListTag)) {
MirrorList = TvDbParser.ParseMirrorList (responseString);
SendRequestsOnHold ();
} else if (requestUrl.Contains (TvShowBaseTag)) {
TvShowBase showBase = TvDbParser.ParseTvShowBase (responseString);
OnTvShowBaseRecieved (showBase);
} else if (requestUrl.Contains (ImagePathReqTag)) {
string showID = GetShowIDFromImagePathRequest (requestUrl);
TvShowImagePath imagePath = TvDbParser.ParseTvShowImagePath (showID, responseString);
OnTvShowImagePathRecieved (imagePath);
}
}
CreateAndDisplayTvShowWidget (TvShow show) is called when the event OnTvShowBaseRecieved (TvShow) is called, which is when I get tons error messages regarding NSAutoReleasePool...
The last two functions are part of what is supposed to be a cross-platform assembly, so I can't have any MonoMac-specific code in there...
I never call any auto-release or release code for my widgets, so I assume that the MonoMac bindings does this automatically as part of its garbage collection?
You can create autorelease pools at point within the call stack, you can even have multiple nested autorelease pools with the same call stack. So you should be able to create your autorelease pools in the async entry functions.
You only need an NSAutoreleasePool if you use the auto-release features of objects. A solution is to create a NSAutoreleasePool around the code that manipulates auto-released objects (in the async callback).
Edit:
Have you tried to encapsulate the creation code with a NSAutoreleasePool ? As this is the only place where you call MonoMac code, this should solve the issue.
private void CreateAndDisplayTvShowWidget (TvShow show)
{
using(NSAutoreleasePool pool = new NSAutoreleasePool())
{
var Widget = new TvShowWidgetController (show);
Widget.OnRemoveWidget += ConfirmRemoveTvShow;
Widget.View.SetFrameOrigin (new PointF (0, -150));
Widget.View.SetFrameSize (new SizeF (ContentView.Frame.Width, 150));
ContentView.AddSubview (Widget.View);
show.ShowWidget = Widget;
}
}
Note that even if you don't use auto-released objects directly, there are some case where the Cococa API use them udner the hood.
I had a similar problem and it was the response.GetResponseStream that was the problem. I surrounded this code with...
using (NSAutoreleasePool pool = new NSAutoreleasePool()) {
}
... and that solved my problem.