Retrieve a string containing html Document source using Task parallel

Retrieve a string containing html Document source using Task parallel - c#

I really hope there's someone experienced enough both with TPL & System.Net Classes and methods
What started as a simple thought of use TPL on current sequential set of actions led me to a halt in my project.
As I am still fresh With .NET, jumping straight to deep water using TPL ...
I was trying to extract an Aspx page's source/content(html) using WebClient
Having multiple requests per day (around 20-30 pages to go through) and extract specific values out of the source code... being only one of few daily tasks the server has on its list,
Led me to try implement it by using TPL, thus gain some speed.
Although I tried using Task.Factory.StartNew() trying to iterate on few WC instances ,
on first try execution of WC the application just does not get any result from the WebClient
This is my last try on it
static void Main(string[] args)
{
EnumForEach<Act>(Execute);
Task.WaitAll();
}
public static void EnumForEach<Mode>(Action<Mode> Exec)
{
foreach (Mode mode in Enum.GetValues(typeof(Mode)))
{
Mode Curr = mode;
Task.Factory.StartNew(() => Exec(Curr) );
}
}
string ResultsDirectory = Environment.CurrentDirectory,
URL = "",
TempSourceDocExcracted ="",
ResultFile="";
enum Act
{
dolar, ValidateTimeOut
}
void Execute(Act Exc)
{
switch (Exc)
{
case Act.dolar:
URL = "http://www.AnyDomainHere.Com";
ResultFile =ResultsDirectory + "\\TempHtm.htm";
TempSourceDocExcracted = IeNgn.AgilityPacDocExtraction(URL).GetElementbyId("Dv_Main").InnerHtml;
File.WriteAllText(ResultFile, TempSourceDocExcracted);
break;
case Act.ValidateTimeOut:
URL = "http://www.AnotherDomainHere.Com";
ResultFile += "\\TempHtm.htm";
TempSourceDocExcracted = IeNgn.AgilityPacDocExtraction(URL).GetElementbyId("Dv_Main").InnerHtml;
File.WriteAllText(ResultFile, TempSourceDocExcracted);
break;
}
//usage of HtmlAgilityPack to extract Values of elements by their attributes/properties
public HtmlAgilityPack.HtmlDocument AgilityPacDocExtraction(string URL)
{
using (WC = new WebClient())
{
WC.Proxy = null;
WC.Encoding = Encoding.GetEncoding("UTF-8");
tmpExtractedPageValue = WC.DownloadString(URL);
retAglPacHtmDoc.LoadHtml(tmpExtractedPageValue);
return retAglPacHtmDoc;
}
}
What am I doing wrong? Is it possible to use a WebClient using TPL at all or should I use another tool (not being able to use IIS 7 / .net4.5)?

I see at least several issues:
naming - FlNm is not a name - VisualStudio is modern IDE with smart code completion, there's no need to save keystrokes (you may start here, there are alternatives too, main thing is too keep it consistent: C# Coding Conventions.
If you're using multithreading, you need to care about resource sharing. For example FlNm is a static string and it is assigned inside each thread, so it's value is not deterministic (also even if it was running sequentially, code would work faulty - you would adding file name in path in each iteration, so it would be like c:\TempHtm.htm\TempHtm.htm\TempHtm.htm)
You're writing to the same file from different threads (well, at least that was your intent I think) - usually that's a recipe for disaster in multithreading. Question is, if you need at all write anything to disk, or it can be downloaded as string and parsed without touching disk - there's a good example what does it mean to touch a disk.
Overall I think you should parallelize only downloading, so do not involve HtmlAgilityPack in multithreading, as I think you don't know it is thread safe. On the other hand, downloading will have good performance/thread count ratio, html parsing - not so much, may be if thread count will be equal to cores count, but not more. Even more - I would separate downloading and parsing, as it would be easier to test, understand and maintain.
Update: I don't understand your full intent, but this may help you started (it's not production code, you should add retry/error catching, etc.).
Also at the end is extended WebClient class allowing you to get more threads spinning, because by default webclient allows only two connections.
class Program
{
static void Main(string[] args)
{
var urlList = new List<string>
{
"http://google.com",
"http://yahoo.com",
"http://bing.com",
"http://ask.com"
};
var htmlDictionary = new ConcurrentDictionary<string, string>();
Parallel.ForEach(urlList, new ParallelOptions { MaxDegreeOfParallelism = 20 }, url => Download(url, htmlDictionary));
foreach (var pair in htmlDictionary)
{
Process(pair);
}
}
private static void Process(KeyValuePair<string, string> pair)
{
// do the html processing
}
private static void Download(string url, ConcurrentDictionary<string, string> htmlDictionary)
{
using (var webClient = new SmartWebClient())
{
htmlDictionary.TryAdd(url, webClient.DownloadString(url));
}
}
}
public class SmartWebClient : WebClient
{
private readonly int maxConcurentConnectionCount;
public SmartWebClient(int maxConcurentConnectionCount = 20)
{
this.maxConcurentConnectionCount = maxConcurentConnectionCount;
}
protected override WebRequest GetWebRequest(Uri address)
{
var httpWebRequest = (HttpWebRequest)base.GetWebRequest(address);
if (httpWebRequest == null)
{
return null;
}
if (maxConcurentConnectionCount != 0)
{
httpWebRequest.ServicePoint.ConnectionLimit = maxConcurentConnectionCount;
}
return httpWebRequest;
}
}

Related

Main function looping based on file called in a different class

Rookie here so please be nice!! I have had so much fun learning to program and gotten some great help along the way when google failed me. But alas, I'm stuck again.
I have a C# program that looks like this (It's MWS if anyone is familiar)
I've tried so many different ways to get this to effectively loop through a list of values in a text file. The problem I'm having is that the Main function is where I have to set the loop, but the BuildClass is where I need to cycle through the values in the text file (sentinel). I've included some stuff that probably isn't necessary just in case it is messing my code up and I don't realize it.
Here's what I've tried:
setting the loop inside the BuildClass - didn't expect it to work but it threw an exception before getting to the sentinel.
Reference the sentinel within the main function by changing the "using" or "var" in the main function sentinel to public - turned EVERYTHING red in visual studio
moving the string sentinel outside the main function so that the function and the BuildClass would recognize it - main function did not recognize it anymore.
I've tried so many other things unsuccessfully. I've gotten it to loop with the same sentinel value passed from BuildClass to the function over and over again but that's about it.
What I think I need:
A destructive version of streamReader that will remove the value from the text file when reading it. I'll put this inside the BuildClass, so that the next loop of the main function, the next value will be read and passed into the main function until the file is empty, terminating the loop.
an understanding of why changing sentinel to public destroys the code so badly. I have a decent understanding of why the other attempts wouldn't work.
namespace MainSpace
{
public class MainClass
{
int i;
public static void Main(string[] args)
{
ClientClass client = new ClientInterface(appName, appVersion, password, config);
MainClass sample = new MainClass(client);
string sentinel;
using (var streamReader = new StreamReader(#"sample.txt", true))
while((sentinel = streamReader.ReadLine()) != null)
{
try
{
//stuff
response = sample.InvokeBuild();
Console.WriteLine("Response Stuff");
string responseXml = response.ToXML();
Console.WriteLine(responseXml);
StreamWriter FileWrite = new StreamWriter("FileTest.xml", true);
FileWrite.WriteLine(responseXml);
FileWrite.Close();
}
catch (ExceptionsClass)
{
// Exception stuff
throw ex;
}
}
}
private readonly ClientInterface client;
public MainClass(ClientInterface client)
{
this.client = client;
}
public BuildClass InvokeBuild()
{
{
using (var streamReader = new StreamReader("sample.txt", true))
{
string sentinel = streamReader.ReadLine();
Thread.Sleep(6000);
i++;
Console.WriteLine("attempt " + i);
// Create a request.
RequestClass request = new RequestClass();
//Password Stuff
request.IdType = idType;
IdListType idList = new IdListType();
idList.Id.Add(sentinel);
request.IdList = idList;
return this.client.RequestClass(request);
}
}
}
}

Lock text file during read and write or alternative

I have an application where I need to create files with a unique and sequential number as part of the file name. My first thought was to use (since this application does not have any other data storage) a text file that would contain a number and I would increment this number so then my application would always create a file with a unique id.
Then I thought that maybe at a time when there are more than one user submitting to this application at the same time, one process might be reading the txt file before it has been written by the previous process. So then I am looking for a way to read and write to a file (with try catch so then I can know when it's being used by another process and then wait and try to read from it a few other times) in the same 'process' without unlocking the file in between.
If what I am saying above sounds like a bad option, could you please give me an alternative to this? How would you then keep track of unique identification numbers for an application like my case?
Thanks.

If it's a single application then you can store the current number in your application settings. Load that number at startup. Then with each request you can safely increment it and use the result. Save the sequential number when the program shuts down. For example:
private int _fileNumber;
// at application startup
_fileNumber = LoadFileNumberFromSettings();
// to increment
public int GetNextFile()
{
return Interlocked.Increment(ref _fileNumber);
}
// at application shutdown
SaveFileNumberToSettings(_fileNumber);
Or, you might want to make sure that the file number is saved whenever it's incremented. If so, change your GetNextFile method:
private readonly object _fileLock = new object();
public int GetNextFile()
{
lock (_fileLock)
{
int result = ++_fileNumber;
SaveFileNumbertoSettings(_fileNumber);
return result;
}
}
Note also that it might be reasonable to use the registry for this, rather than a file.

Edit: As Alireza pointed in the comments, it is not a valid way to lock between multiple applications.
You can always lock the access to the file (so you won't need to rely on exceptions).
e.g:
// Create a lock in your class
private static object LockObject = new object();
// and then lock on this object when you access the file like this:
lock(LockObject)
{
... access to the file
}
Edit2: It seems that you can use Mutex to perform inter-application signalling.
private static System.Threading.Mutex m = new System.Threading.Mutex(false, "LockMutex");
void AccessMethod()
{
try
{
m.WaitOne();
// Access the file
}
finally
{
m.ReleaseMutex();
}
}
But it's not the best pattern to generate unique ids. Maybe a sequence in a database would be better ? If you don't have a database, you can use Guids or a local database (even Access would be better I think)

I would prefer a complex and universal solution with the global mutex. It uses a mutex with name prefixed with "Global\" which makes it system-wide i.e. one mutex instance is shared across all processes. if your program runs in friendly environment or you can specify strict permissions limited to a user account you can trust then it works well.
Keep in mind that this solution is not transactional and is not protected against thread-abortion/process-termination.
Not transactional means that if your process/thread is caught in the middle of storage file modification and is terminated/aborted then the storage file will be left in unknown state. For instance it can be left empty. You can protect yourself against loss of data (loss of last used index) by writing the new value first, saving the file and only then removing the previous value. Reading procedure should expect a file with multiple numbers and should take the greatest.
Not protected against thread-abortion means that if a thread which obtained the mutex is aborted unexpectedly and/or you do not have proper exception handling then the mutex could stay locked for the life of the process that created that thread. In order to make solution abort-protected you will have to implement timeouts on obtaining the lock i.e. replace the following line which waits forever
blnResult = iLock.Mutex.WaitOne();
with something with timeout.
Summing this up I try to say that if you are looking for a really robust solution you will come to utilizing some kind of a transactional database or write a kind of such a database yourself :)
Here is the working code without timeout handling (I do not need it in my solution). It is robust enough to begin with.
using System;
using System.IO;
using System.Security.AccessControl;
using System.Security.Principal;
using System.Threading;
namespace ConsoleApplication31
{
class Program
{
//You only need one instance of that Mutex for each application domain (commonly each process).
private static SMutex mclsIOLock;
static void Main(string[] args)
{
//Initialize the mutex. Here you need to know the path to the file you use to store application data.
string strEnumStorageFilePath = Path.Combine(
Environment.GetFolderPath(Environment.SpecialFolder.LocalApplicationData),
"MyAppEnumStorage.txt");
mclsIOLock = IOMutexGet(strEnumStorageFilePath);
}
//Template for the main processing routine.
public static void RequestProcess()
{
//This flag is used to protect against unwanted lock releases in case of recursive routines.
bool blnLockIsSet = false;
try
{
//Obtain the lock.
blnLockIsSet = IOLockSet(mclsIOLock);
//Read file data, update file data. Do not put much of long-running code here.
//Other processes may be waiting for the lock release.
}
finally
{
//Release the lock if it was obtained in this particular call stack frame.
IOLockRelease(mclsIOLock, blnLockIsSet);
}
//Put your long-running code here.
}
private static SMutex IOMutexGet(string iMutexNameBase)
{
SMutex clsResult = null;
clsResult = new SMutex();
string strSystemObjectName = #"Global\" + iMutexNameBase.Replace('\\', '_');
//Give permissions to all authenticated users.
SecurityIdentifier clsAuthenticatedUsers = new SecurityIdentifier(WellKnownSidType.AuthenticatedUserSid, null);
MutexSecurity clsMutexSecurity = new MutexSecurity();
MutexAccessRule clsMutexAccessRule = new MutexAccessRule(
clsAuthenticatedUsers,
MutexRights.FullControl,
AccessControlType.Allow);
clsMutexSecurity.AddAccessRule(clsMutexAccessRule);
//Create the mutex or open an existing one.
bool blnCreatedNew;
clsResult.Mutex = new Mutex(
false,
strSystemObjectName,
out blnCreatedNew,
clsMutexSecurity);
clsResult.IsMutexHeldByCurrentAppDomain = false;
return clsResult;
}
//Release IO lock.
private static void IOLockRelease(
SMutex iLock,
bool? iLockIsSetInCurrentStackFrame = null)
{
if (iLock != null)
{
lock (iLock)
{
if (iLock.IsMutexHeldByCurrentAppDomain &&
(!iLockIsSetInCurrentStackFrame.HasValue ||
iLockIsSetInCurrentStackFrame.Value))
{
iLock.MutexOwnerThread = null;
iLock.IsMutexHeldByCurrentAppDomain = false;
iLock.Mutex.ReleaseMutex();
}
}
}
}
//Set the IO lock.
private static bool IOLockSet(SMutex iLock)
{
bool blnResult = false;
try
{
if (iLock != null)
{
if (iLock.MutexOwnerThread != Thread.CurrentThread)
{
blnResult = iLock.Mutex.WaitOne();
iLock.IsMutexHeldByCurrentAppDomain = blnResult;
if (blnResult)
{
iLock.MutexOwnerThread = Thread.CurrentThread;
}
else
{
throw new ApplicationException("Failed to obtain the IO lock.");
}
}
}
}
catch (AbandonedMutexException iMutexAbandonedException)
{
blnResult = true;
iLock.IsMutexHeldByCurrentAppDomain = true;
iLock.MutexOwnerThread = Thread.CurrentThread;
}
return blnResult;
}
}
internal class SMutex
{
public Mutex Mutex;
public bool IsMutexHeldByCurrentAppDomain;
public Thread MutexOwnerThread;
}
}

retrieving partial content using multiple http requsets to fetch data via parllel tasks

i am trying to be as thorough as i can in this post, as it is very important for me,
though the issue is very simple, and only by reading the title of this question, you can get the idea...
question is:
with healthy bandwidth (30mb Vdsl) available...
how is it possible to get multiple httpWebRequest for a single data / file ?,
so each reaquest,will download only a portion of the data
then when all instances have completed, all parts are joined back to one piece.
Code:
...what i have got working so far is same idea only each task =HttpWebRequest = different file,
so speedup is pure tasks parallelism rather acceleration of one download using multiple tasks/threads
as in my question.
see code below
the next part is only more detailed explantion and background on the subject...if you don't mind reading.
while i am still on a similar project that differ from this (in question)one,
in the way that it(see code below..) was trying to fetch as many different data sources for each of separated tasks(different downloads/files).
... so the speedup was gaind while each(task) does not have to wait for the former one to complete first before it get a chance to be executed .
what i am trying to do in this current-subjected question (having allmost everything ready in the code below) is actually targetting same url for same data,
so this time the speedup to gain is for the single-task - current download .
implementing same idea as in code below only this time let SmartWebClient target same url by
using multiple instances.
then (only theory for now) it will request partial content of data,
with multiple requests with each one of instances .
last issue is i need to "put puzle back to one peace"... another problem i need to find out about...
as you can see in this code , what i did not get to work on yet is only the data parsing/processing which i find to be very easy using htmlAgilityPack so no problem.
current code
main entry:
var htmlDictionary = urlsForExtraction.urlsConcrDict();
Parallel.ForEach(
urlList.Values,
new ParallelOptions { MaxDegreeOfParallelism = 20 },
url => Download(url, htmlDictionary)
);
foreach (var pair in htmlDictionary)
{
///Process(pair);
MessageBox.Show(pair.Value);
}
public class urlsForExtraction
{
const string URL_Dollar= "";
const string URL_UpdateUsersTimeOut="";
public ConcurrentDictionary<string, string> urlsConcrDict()
{
//need to find the syntax to extract fileds names so it would be possible to iterate on each instead of specying
ConcurrentDictionary<string, string> retDict = new Dictionary<string,string>();
retDict.TryAdd("URL_Dollar", "Any.Url.com");
retDict.TryAdd("URL_UpdateUserstbl", "http://bing.com");
return retDict;
}
}
/// <summary>
/// second Stage Class consumes the Dictionary of urls for extraction
/// then downloads Each via parallel for each using The Smart WeBClient! (download(); )
/// </summary>
public class InitConcurentHtmDictExtrct
{
private void Download(string url, ConcurrentDictionary<string, string> htmlDictionary)
{
using (var webClient = new SmartWebClient())
{
webClient.Encoding = Encoding.GetEncoding("UTF-8");
webClient.Proxy = null;
htmlDictionary.TryAdd(url, webClient.DownloadString(url));
}
}
private ConcurrentDictionary<string, string> htmlDictionary;
public ConcurrentDictionary<string, string> LoopOnUrlsVia_SmartWC(Dictionary<string, string> urlList)
{
htmlDictionary = new ConcurrentDictionary<string, string>();
Parallel.ForEach(
urlList.Values,
new ParallelOptions { MaxDegreeOfParallelism = 20 },
url => Download(url, htmlDictionary)
);
return htmlDictionary;
}
}
/// <summary>
/// the Extraction Process, done via "HtmlAgility pack"
/// easy usage to collect information within a given html Documnet via referencing elements attributes
/// </summary>
public class Results
{
public struct ExtracionParameters
{
public string FileNameToSave;
public string directoryPath;
public string htmlElementType;
}
public enum Extraction
{
ById, ByClassName, ByElementName
}
public void ExtractHtmlDict( ConcurrentDictionary<string, string> htmlResults, Extract By)
{
// helps with easy elements extraction from the page.
HtmlAttribute htAgPcAttrbs;
HtmlDocument HtmlAgPCDoc = new HtmlDocument();
/// will hold a name+content of each documnet-part that was aventually extracted
/// then from this container the build of the result page will be possible
Dictionary<string, HtmlDocument> dictResults = new Dictionary<string, HtmlDocument>();
foreach (KeyValuePair<string, string> htmlPair in htmlResults)
{
Process(htmlPair);
}
}
private static void Process(KeyValuePair<string, string> pair)
{
// do the html processing
}
}
public class SmartWebClient : WebClient
{
private readonly int maxConcurentConnectionCount;
public SmartWebClient(int maxConcurentConnectionCount = 20)
{
this.Proxy = null;
this.Encoding = Encoding.GetEncoding("UTF-8");
this.maxConcurentConnectionCount = maxConcurentConnectionCount;
}
protected override WebRequest GetWebRequest(Uri address)
{
var httpWebRequest = (HttpWebRequest)base.GetWebRequest(address);
if (httpWebRequest == null)
{
return null;
}
if (maxConcurentConnectionCount != 0)
{
httpWebRequest.ServicePoint.ConnectionLimit = maxConcurentConnectionCount;
}
return httpWebRequest;
}
}
}
this allows me to take advantage of good bandwith,
only i am far from the subjected solution, i will realy appriciate any clue on where to start .

If the server support what's wikipedia calls byte serving, you can multiplex a file download spawning multiple requests with a specific Range header value (using the AddRange method. See also How to download the data from the server discontinuously？). Most serious HTTP servers do support byte-range.
Here is some sample code that implements a parallel download of a file using byte range:
public static void ParallelDownloadFile(string uri, string filePath, int chunkSize)
{
if (uri == null)
throw new ArgumentNullException("uri");
// determine file size first
long size = GetFileSize(uri);
using (FileStream file = new FileStream(filePath, FileMode.Create, FileAccess.Write, FileShare.Write))
{
file.SetLength(size); // set the length first
object syncObject = new object(); // synchronize file writes
Parallel.ForEach(LongRange(0, 1 + size / chunkSize), (start) =>
{
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(uri);
request.AddRange(start * chunkSize, start * chunkSize + chunkSize - 1);
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
lock (syncObject)
{
using (Stream stream = response.GetResponseStream())
{
file.Seek(start * chunkSize, SeekOrigin.Begin);
stream.CopyTo(file);
}
}
});
}
}
public static long GetFileSize(string uri)
{
if (uri == null)
throw new ArgumentNullException("uri");
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(uri);
request.Method = "HEAD";
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
return response.ContentLength;
}
private static IEnumerable<long> LongRange(long start, long count)
{
long i = 0;
while (true)
{
if (i >= count)
{
yield break;
}
yield return start + i;
i++;
}
}
And sample usage:
private static void TestParallelDownload()
{
string uri = "http://localhost/welcome.png";
string fileName = Path.GetFileName(uri);
ParallelDownloadFile(uri, fileName, 10000);
}
PS: I'd be curious to know if it's really more interesting to do this parallel thing rather than to just use WebClient.DownloadFile... Maybe in slow network scenarios?

C# Parallel ForEach hold value

I am working on an app that searches for email addresses in Google search results' URLs. The problem is it needs to return the value it found in each page + the URL in which it found the email, to a datagridview with 2 columns: Email and URL.
I am using Parallel.ForEach for this one but of course it returns random URLs and not the ones it really found the email on.
public static string htmlcon; //htmlsource
public static List<string> emailList = new List<string>();
public static string Get(string url, bool proxy)
{
htmlcon = "";
try
{
HttpWebRequest req = (HttpWebRequest)WebRequest.Create(url);
if (proxy)
req.Proxy = new WebProxy(proxyIP + ":" + proxyPort);
req.Method = "GET";
req.UserAgent = Settings1.Default.UserAgent;
if (Settings1.Default.EnableCookies == true)
{
CookieContainer cont = new CookieContainer();
req.CookieContainer = cont;
}
WebResponse resp = req.GetResponse();
StreamReader SR = new StreamReader(resp.GetResponseStream());
htmlcon = SR.ReadToEnd();
Thread.Sleep(400);
resp.Close();
SR.Close();
}
catch (Exception)
{
Thread.Sleep(500);
}
return htmlcon;
}
private void copyMails(string url)
{
string emailPat = #"(\b[a-zA-Z0-9._%-]+#[a-zA-Z0-9.-]+\.[a-zA-Z]{2,4}\b)";
MatchCollection mailcol = Regex.Matches(htmlcon, emailPat, RegexOptions.Singleline);
foreach (Match mailMatch in mailcol)
{
email = mailMatch.Groups[1].Value;
if (!emailList.Contains(email))
{
emailList.Add(email);
Action dgeins = () => mailDataGrid.Rows.Insert(0, email, url);
mailDataGrid.BeginInvoke(dgeins);
}
}
}
private void SEbgWorker_DoWork(object sender, DoWorkEventArgs e)
{
//ALOT OF IRRELEVAMT STUFF BEING RUN
Parallel.ForEach(allSElist.OfType<string>(), (s) =>
{
//Get URL
Get(s, Settings1.Default.Proxyset);
//match mails 1st page
copyMails(s);
});
}
so this is it: I execute a Get request(where "s" is the URL from the list) and then execute copyMails(s) from the URL's html source. It uses regex to copy the emails.
If I do it without parallel it returns the correct URL for each email in the datagridview. How can I do this parallel an still get the correct match in the datagridview?
Thanks

You would be better off using PLINQ's Where to filter (pseudo code):
var results = from i in input.AsParallel()
let u = get the URL from i
let d = get the data from u
let v = try get the value from d
where v is found
select new {
Url = u,
Value = v
};
Underneath the AsParallel means that TPL's implementation of LINQ operators (Select, Where, ...) is used.
UPDATE: Now with more information
First there are a number of issues in your code:
The variable htmlcon is static but used directly by multiple threads. This could well be your underlying problem. Consider just two input values. The first Get completes setting htmlcon, before that thread's call to copyMails starts the second thread's Get completes its HTML GET and writes to htmlcon. With `email
The list emailList is also accessed without locking by multiple threads. Most collection types in .NET (and any other programming platform) are not thread safe, you need to limit access to a single thread at a time.
You are mixing up various activities in each of your methods. Consider applying the singe responsibility principle.
Thread.Sleep to handle an exception?! If you can't handle an exception (ie. resolve the condition) then do nothing. In this case if the action throws then the Parallel.Foreach will throw: that'll do until you define how to handle the HTML GET failing.
Three suggestions:
In my experience clean code (to an obsessive degree) makes things easier: the details of the format
don't matter (one true brace style is better, but consistency is the key). Just going through
and cleaning up the formatting showed up issues #1 and #2.
Good naming. Don't abbreviate anything used over more than a few lines of code unless that is a
significant term for the domain. Eg. s for the action parameter in the parallel loop is really a url
so call it that. This kind of thing immediately makes the code easier to follow.
Think about that regex for emails: there are many valid emails that will not match (eg. use of + to provide multiple logical addresses: exmaple+one#gamil.com will be delivered to example#gmail.com and can then be used for local rules). Also an apostrophe ("'") is a valid character (and known people frustrated by web sites that refused their addresses by getting this wrong).
Second: A relatively direct clean up:
public static string Get(string url, bool proxy) {
HttpWebRequest req = (HttpWebRequest)WebRequest.Create(url);
if (proxy) {
req.Proxy = new WebProxy(proxyIP + ":" + proxyPort);
}
req.Method = "GET";
req.UserAgent = Settings1.Default.UserAgent;
if (Settings1.Default.EnableCookies == true) {
CookieContainer cont = new CookieContainer();
req.CookieContainer = cont;
}
using (WebResponse resp = req.GetResponse())
using (StreamReader SR = new StreamReader(resp.GetResponseStream())) {
return SR.ReadToEnd();
}
}
private static Regex emailMatcher = new Regex(#"(\b[a-zA-Z0-9._%-]+#[a-zA-Z0-9.-]+\.[a-zA-Z]{2,4}\b)", RegexOptions.Singleline);
private static string[] ExtractEmails(string htmlContent) {
return emailMatcher.Matches(htmlContent).OfType<Match>
.Select(m => m.Groups[1].Value)
.Distinct()
.ToArray();
}
private void SEbgWorker_DoWork(object sender, DoWorkEventArgs e) {
Parallel.ForEach(allSElist.OfType<string>(), url => {
var htmlContent = Get(url, Settings1.Default.Proxyset);
var emails = ExtractEmails(htmlContent);
foreach (var email in emails) {
Action dgeins = () => mailDataGrid.Rows.Insert(0, email, url);
mailDataGrid.BeginInvoke(dgeins);
}
}
Here I have:
Made use of using statements to automate the cleanup of resources.
Eliminated all mutable shared state.
Regex is explicitly documented to have thread safe instance methods. So I only need a single instance.
Removed noise: no need to pass the URL to ExtractEmails because the extraction doesn't use the URL.
Get now only performs the HTML get, ExtreactEMail just the extraction
Third: The above will block threads on the slowest operation: the HTML GET.
The real concurrency benefit would be to replace HttpWebRequest.GetResponse and reading the response stream with their asynchronous equivalents.
Using Task would be the answer in .NET 4, but you need to directly work with Stream and encoding yourself because StreamReader doesn't provide any BeginABC/EndABC method pairs. But .NET 4.5 is almost here, so apply some async/await:
Nothing to do in ExtractEMails.
Get is now asynchronous, blocking in neither the HTTP GET or reading the result.
SEbgWorker_DoWork uses Tasks directly to avoid mixing too many different ways to work with TPL. Since Get returns a Task<string> can simple continue (when it hasn't failed – unless you specify otherwise ContinueWith will only continue if the previous task has completed successfully):
This should work in .NET 4.5, but without a set of valid URLs for which this will work I cannot test.
public static async Task<string> Get(string url, bool proxy) {
HttpWebRequest req = (HttpWebRequest)WebRequest.Create(url);
if (proxy) {
req.Proxy = new WebProxy(proxyIP + ":" + proxyPort);
}
req.Method = "GET";
req.UserAgent = Settings1.Default.UserAgent;
if (Settings1.Default.EnableCookies == true) {
CookieContainer cont = new CookieContainer();
req.CookieContainer = cont;
}
using (WebResponse resp = await req.GetResponseAsync())
using (StreamReader SR = new StreamReader(resp.GetResponseStream())) {
return await SR.ReadToEndAsync();
}
}
private static Regex emailMatcher = new Regex(#"(\b[a-zA-Z0-9._%-]+#[a-zA-Z0-9.-]+\.[a-zA-Z]{2,4}\b)", RegexOptions.Singleline);
private static string[] ExtractEmails(string htmlContent) {
return emailMatcher.Matches(htmlContent).OfType<Match>
.Select(m => m.Groups[1].Value)
.Distinct()
.ToArray();
}
private void SEbgWorker_DoWork(object sender, DoWorkEventArgs e) {
tasks = allSElist.OfType<string>()
.Select(url => {
return Get(url, Settings1.Default.Proxyset)
.ContinueWith(htmlContentTask => {
// No TaskContinuationOptions, so know always OK here
var htmlContent = htmlContentTask.Result;
var emails = ExtractEmails(htmlContent);
foreach (var email in emails) {
// No InvokeAsync on WinForms, so do this the old way.
Action dgeins = () => mailDataGrid.Rows.Insert(0, email, url);
mailDataGrid.BeginInvoke(dgeins);
}
});
});
tasks.WaitAll();
}

public static string htmlcon; //htmlsource
public static List emailList = new List();
Problem is because these members htmlcon and emailList are shared resource among thread and among iterations. Each your iteration in Parallel.ForEach is executed parallel. Thats why you have strange behaviour.
How to solve problem:
Modify your code and try to implement it without static variables or shared state.
As an option is change from Parallel.ForEach to TPL Task chaining, when you make this change then result of one parallel operation will be input data for other and it's as an options among many how to modify code to avoid shared state.
Use locking or concurrent collections. Your htmlcon variable could be made volatile but with list you should yous lock's or concurrent collections.
Better way is modify your code to avoid shared state, and how to do that are many options based on your implementation, not only task chaining.

Recursive Async HttpWebRequests

Suppose I have the following class:
Public class FooBar
{
List<Items> _items = new List<Items>();
public List<Items> FetchItems(int parentItemId)
{
FetchSingleItem(int itemId);
return _items
}
private void FetchSingleItem(int itemId)
{
Uri url = new Uri(String.Format("http://SomeURL/{0}.xml", itemId);
HttpWebRequest webRequest = (HttpWebRequest)HttpWebRequest.Create(url);
webRequest.BeginGetResponse(ReceiveResponseCallback, webRequest);
}
void ReceiveResponseCallback(IAsyncResult result)
{
// End the call and extract the XML from the response and add item to list
_items.Add(itemFromXMLResponse);
// If this item is linked to another item then fetch that item
if (anotherItemIdExists == true)
{
FetchSingleItem(anotherItemId);
}
}
}
There could be any number of linked items that I will only know about at runtime.
What I want to do is make the initial call to FetchSingleItem and then wait until all calls have completed then return List<Items> to the calling code.
Could someone point me in the right direction? I more than happy to refactor the whole thing if need be (which I suspect will be the case!)

Getting the hang of asynchronous coding is not easy especially when there is some sequential dependency between one operation and the next. This is the exact sort of problem that I wrote the AsyncOperationService to handle, its a cunningly short bit of code.
First a little light reading for you: Simple Asynchronous Operation Runner – Part 2. By all means read part 1 but its a bit heavier than I had intended. All you really need is the AsyncOperationService code from it.
Now in your case you would convert your fetch code to something like the following.
private IEnumerable<AsyncOperation> FetchItems(int startId)
{
XDocument itemDoc = null;
int currentId = startId;
while (currentID != 0)
{
yield return DownloadString(new Uri(String.Format("http://SomeURL/{0}.xml", currentId), UriKind.Absolute),
itemXml => itemDoc = XDocument.Parse(itemXml) );
// Do stuff with itemDoc like creating your item and placing it in the list.
// Assign the next linked ID to currentId or if no other items assign 0
}
}
Note the blog also has an implementation of DownloadString which in turn uses WebClient which simplifies things. However the principles still apply if for some reason you must stick with HttpWebRequest. (Let me know if you are having trouble creating an AsyncOperation for this)
You would then use this code like this:-
int startId = GetSomeIDToStartWith();
Foo myFoo = new Foo();
myFoo.FetchItems(startId).Run((err) =>
{
// Clear IsBusy
if (err == null)
{
// All items are now fetched continue doing stuff here.
}
else
{
// "Oops something bad happened" code here
}
}
// Set IsBusy
Note that the call to Run is asynchronous, code execution will appear to jump past it before all the items are fetched. If the UI is useless to the user or even dangerous then you need to block it in a friendly way. The best way (IMO) to do this is with the BusyIndicator control from the toolkit, setting its IsBusy property after the call to Run and clearing it in the Run callback.

All you need is a thread sync thingy. I chose ManualResetEvent.
However, I don't see the point of using asynchronous IO since you always wait for the request to finish before starting a new one. But the example might not show the whole story?
Public class FooBar
{
private ManualResetEvent _completedEvent = new ManualResetEvent(false);
List<Items> _items = new List<Items>();
public List<Items> FetchItems(int parentItemId)
{
FetchSingleItem(itemId);
_completedEvent.WaitOne();
return _items
}
private void FetchSingleItem(int itemId)
{
Uri url = new Uri(String.Format("http://SomeURL/{0}.xml", itemId);
HttpWebRequest webRequest = (HttpWebRequest)HttpWebRequest.Create(url);
webRequest.BeginGetResponse(ReceiveResponseCallback, webRequest);
}
void ReceiveResponseCallback(IAsyncResult result)
{
// End the call and extract the XML from the response and add item to list
_items.Add(itemFromXMLResponse);
// If this item is linked to another item then fetch that item
if (anotherItemIdExists == true)
{
FetchSingleItem(anotherItemId);
}
else
_completedEvent.Set();
}
}

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Retrieve a string containing html Document source using Task parallel - c#

Related

Main function looping based on file called in a different class

Lock text file during read and write or alternative

retrieving partial content using multiple http requsets to fetch data via parllel tasks

C# Parallel ForEach hold value

Recursive Async HttpWebRequests

Categories

Resources