Web crawler time out - c#

I am working on a simple web crawler to get a URL, Crawl first level links on the site and extract mails from all pages using RegEx...
I know it's kinda sloppy and it's just the beginning, but i always get "operation Timed Out" after 2 minutes of running the script..
private void button1_Click(object sender, System.EventArgs e)
{
string url = textBox1.Text;
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
StreamReader sr = new StreamReader(response.GetResponseStream());
string code = sr.ReadToEnd();
string re = "href=\"(.*?)\"";
MatchCollection href = Regex.Matches(code, #re, RegexOptions.Singleline);
foreach (Match h in href)
{
string link = h.Groups[1].Value;
if (!link.Contains("http://"))
{
HttpWebRequest request2 = (HttpWebRequest)WebRequest.Create(url + link);
HttpWebResponse response2 = (HttpWebResponse)request2.GetResponse();
StreamReader sr2 = new StreamReader(response.GetResponseStream());
string innerlink = sr.ReadToEnd();
MatchCollection m2 = Regex.Matches(code, #"([\w-]+(\.[\w-]+)*#([a-z0-9-]+(\.[a-z0-9-]+)*?\.[a-z]{2,6}|(\d{1,3}\.){3}\d{1,3})(:\d{4})?)", RegexOptions.Singleline);
foreach (Match m in m2)
{
string email = m.Groups[1].Value;
if (!listBox1.Items.Contains(email))
{
listBox1.Items.Add(email);
}
}
}
}
sr.Close();
}

Don't parse Html using Regex. Use the Html Agility Pack for that.
What is exactly the Html Agility Pack (HAP)?
This is an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT (you actually don't HAVE to understand XPATH nor XSLT to use it, don't worry...). It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML documents (or streams).
More Information
Html Agility Pack on Codeplex
How to use HTML Agility pack

The comment by Oded is correct, we need to know what you need help with specifically; however, I can at least point you to the HtmlAgility Pack as it will solve most of your web scraping woes.
Good Luck!
Matt

Related

Search for specific text in html response (ASP.NET Core)

I need to search for specific word in html we page.
I try to do this using c# (asp.net core)
My point is to get url and word for search from View via js
and than in response if word is exist show it , if not, show smth
I make method for getting html code of page. Here is code
[HttpPost]
public JsonResult SearchWord([FromBody] RequestModel model){
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(model.adress);
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
Stream receiveStream = response.GetResponseStream();
StreamReader readStream = null;
if (response.CharacterSet == null)
{
readStream = new StreamReader(receiveStream);
}
else
{
readStream = new StreamReader(receiveStream, Encoding.GetEncoding(response.CharacterSet));
}
string data = readStream.ReadToEnd();
string strRegex = model.word;
response.Close();
readStream.Close();
return Json(data);
}
But, how I need to search for word correctly?
To add to the previous answer you can use Regex.Match. Something like:
string pattern = #"(\w+)\s+(strRegex)";
// Instantiate the regular expression object.
Regex r = new Regex(pattern, RegexOptions.IgnoreCase);
// Match the regular expression pattern against your html data.
Match m = r.Match(data);
if (m.Success) {
//Add your logic here
}
NOTE: There are quite a few things you can do to optimize your code, specifically looking at how you are handling stream reader. I would read in chunks and try and match the chunk.
You will not be able to do much with simple pattern matching, check out this SO classic - RegEx match open tags except XHTML self-contained tags. Consider using some web scraping library like html-agility-pack if you want to do some serious scraping. If you want to only search for the single word in a web-page, no matter whether it's a markup or CDATA etc., just join all the chars in an array and use string.Contains, or Regex.

how to remove htmldocument.cs not found error in html agility pack

class Response:
public string WebResponse(string url) //class through which i'll have link of website and will parse some divs in method of this class
{
string html = string.Empty;
try
{
HtmlDocument doc = new HtmlDocument(); //when code comes here it gives an error htmldocument.cs not found,and open window for browsing source
WebClient client = new WebClient(); // even if i put htmlWeb there it still look for HtmlWeb.cs not found
html = client.DownloadString(url); //is this from some breakpoint error coz i set only one in method where i am parsing,
doc.LoadHtml(html);
}
catch (Exception)
{
html = string.Empty;
}
return html; //please help me to remove this error using html agility pack with console application
}
even if i make new project and run code it stuck here and i have added DLL too still it is giving me this error please help me to remove this error
WebResponse is an abstract class meaning it is a reserved word first of all. Second - In order to use WebResponse a class has to inherit from WebResponse ie.
public class WR : WebResponse
{
//Code
}
Also. Your current code has nothing to with Html Agility Pack. If you want to load the html of a webpage into a HtmlDocument - do the following:
HtmlAgilityPack.HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument();
try{
var temp = new Uri(url);
var request = (HttpWebRequest)WebRequest.Create(temp);
request.Method = "GET";
using (var response = (HttpWebResponse)request.GetResponse())
{
using (var stream = response.GetResponseStream())
{
htmlDoc.Load(stream, Encoding.GetEncoding("iso-8859-9"));
}
}
}catch(WebException ex){
Console.WriteLine(ex.Message);
}
Then in order to get nodes in the Html Document you have to use xPath like so:
HtmlNode node = htmlDoc.DocumentNode.SelectSingleNode("//body");
Console.WriteLine(node.InnerText);
that error is sometimes because of version of you are using of Nuget html agility pack,update your nuget in the visual studio gallery then try installing html agility pack and run in your project
You can try cleaning and re-building the solution.This may fix the issue.

Extracting url links within a downloaded txt file

Currently working on a url extractor for work. I'm trying to extract all http links/ href links from a downloaded html file and print the links on there own in a separate txt file.So far I've managed to get the entire html of a page downloaded its just extracting the links from it and printing them using Regex is a problem. Wondering if anyone could help me with this?
private void button2_Click(object sender, EventArgs e)
{
Uri fileURI = new Uri(URLbox2.Text);
WebRequest request = WebRequest.Create(fileURI);
request.Credentials = CredentialCache.DefaultCredentials;
WebResponse response = request.GetResponse();
Console.WriteLine(((HttpWebResponse)response).StatusDescription);
Stream dataStream = response.GetResponseStream();
StreamReader reader = new StreamReader(dataStream);
string responseFromServer = reader.ReadToEnd();
SW = File.CreateText("C:\\Users\\Conal_Curran\\OneDrive\\C#\\MyProjects\\Web Crawler\\URLTester\\response1.htm");
SW.WriteLine(responseFromServer);
SW.Close();
string text = System.IO.File.ReadAllText(#"C:\\Users\\Conal_Curran\\OneDrive\\C#\\MyProjects\\Web Crawler\\URLTester\\response1.htm");
string[] links = System.IO.File.ReadAllLines(#"C:\\Users\\Conal_Curran\\OneDrive\\C#\\MyProjects\\Web Crawler\\URLTester\\response1.htm");
Regex regx = new Regex(links, #"http://([\\w+?\\.\\w+])+([a-zA-Z0-9\\~\\!\\#\\#\\$\\%\\^\\&\\*\\(\\)_\\-\\=\\+\\\\\\/\\?\\.\\:\\;\\'\\,]*)?", RegexOptions.IgnoreCase);
MatchCollection mactches = regx.Matches(text);
foreach (Match match in mactches)
{
text = text.Replace(match.Value, "<a href='" + match.Value + "'>" + match.Value + "</a>");
}
SW = File.CreateText("C:\\Users\\Conal_Curran\\OneDrive\\C#\\MyProjects\\Web Crawler\\URLTester\\Links.htm");
SW.WriteLine(links);
}
In case you do not know, this can be achieved (pretty easily) using one of the html parser nuget packages available.
I personally use HtmlAgilityPack (along with ScrapySharp, another package) and AngleSharp.
With only the 3 lines above, you have all the hrefs in the document loaded by your http get request, using HtmlAgilityPack:
/*
do not forget to include the usings:
using HtmlAgilityPack;
using ScrapySharp.Extensions;
*/
HtmlWeb w = new HtmlWeb();
//since you have your html locally stored, you do the following:
//P.S: By prefixing file path strings with #, you are rid of having to escape slashes and other fluffs.
var doc = HtmlDocument.LoadHtml(#"C:\Users\Conal_Curran\OneDrive\C#\MyProjects\Web Crawler\URLTester\response1.htm");
//for an http get request
//var doc = w.Load("yourAddressHere");
var hrefs = doc.DocumentNode.CssSelect("a").Select(a => a.GetAttributeValue("href"));

Read specific div from HttpResponse

I am sending 1 httpWebRequest and reading the response.
I am getting full page in the response.
I want to get 1 div which is names ad Rate from the response.
So how can I match that pattern?
My code is like:
HttpWebRequest WebReq = (HttpWebRequest)WebRequest.Create("http://www.domain.com/");
HttpWebResponse WebResp = (HttpWebResponse)WebReq.GetResponse();
Stream response = WebResp.GetResponseStream();
StreamReader data = new StreamReader(response);
string result = data.ReadToEnd();
I am getting response like:
<HTML><BODY><div id="rate">Todays rate 55 Rs.</div></BODY></HTML>
I want to read data of div rate. i.e. I should get Content "Todays rate 55 Rs."
So how can I make regex for this???
The HTML Agility Pack can load and parse the file for you, no need for messy streams and responses:
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load("http://jsbin.com/owobe3");
HtmlNode rateNode = doc.DocumentNode.SelectSingleNode("//div[#id='rate']");
string rate = rateNode.InnerText;
You should read the entire response and then use something like the Html Agility Pack to parse the response and extract the bits you want in an xpath-like syntax:
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(result);
var output = doc.DocumentNode.SelectSingleNode("//div[#id='rate']").InnerHtml;
Dont use regular expressions!
If you have only one Todays rate text then you can do it like this:
Todays rate \d+ Rs.
In other case you can add div tag in your regex.
Edit: Sorry, haven't installed regex locally
You need to use grouping and get value from the group. It will look like this
<div id="rate">(?<group>[^<])</div>
Don't know if it works, however use this idea.

Extract data webpage

Folks,
I'm tryning to extract data from web page using C#.. for the moment I used the Stream from the WebReponse and I parsed it as a big string. It's long and painfull. Someone know better way to extract data from webpage? I say WINHTTP but isn't for c#..
To download data from a web page it is easier to use WebClient:
string data;
using (var client = new WebClient())
{
data = client.DownloadString("http://www.google.com");
}
For parsing downloaded data, provided that it is HTML, you could use the excellent Html Agility Pack library.
And here's a complete example extracting all the links from a given page:
class Program
{
static void Main(string[] args)
{
using (var client = new WebClient())
{
string data = client.DownloadString("http://www.google.com");
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(data);
var nodes = doc.DocumentNode.SelectNodes("//a[#href]");
foreach(HtmlNode link in nodes)
{
HtmlAttribute att = link.Attributes["href"];
Console.WriteLine(att.Value);
}
}
}
}
If the webpage is valid XHTML, you can read it into an XPathDocument and xpath your way quickly and easily straight to the data you want. If it's not valid XHTML, I'm sure there are some HTML parsers out there you can use.
Found a similar question with an answer that should help.
Looking for C# HTML parser

Categories