C# URL Crawler not getting enough links? - c#

I have the following code, however, when I launch it I only ever seam to get a few URLS returned.
while (stopFlag != true)
{
WebRequest request = WebRequest.Create(urlList[i]);
using (WebResponse response = request.GetResponse())
{
using (StreamReader reader = new StreamReader
(response.GetResponseStream(), Encoding.UTF8))
{
string sitecontent = reader.ReadToEnd();
//add links to the list
// process the content
//clear the text box ready for the HTML code
//Regex urlRx = new Regex(#"((https?|ftp|file)\://|www.)[A-Za-z0-9\.\-]+(/[A-Za-z0-9\?\&\=;\+!'\(\)\*\-\._~%]*)*", RegexOptions.IgnoreCase);
Regex urlRx = new Regex(#"(?<url>(http:[/][/]|www.)([a-z]|[A-Z]|[0-9]|[/.]|[~])*)", RegexOptions.IgnoreCase);
MatchCollection matches = urlRx.Matches(sitecontent);
foreach (Match match in matches)
{
string cleanMatch = cleanUP(match.Value);
urlList.Add(cleanMatch);
updateResults(theResults, "\"" + cleanMatch + "\",\n");
}
}
}
}
I think the error is within the regex.
What I am trying to achieve is pull a webpage, then grab all the links from that page - add these to a list, then for each list item fetch the next page and repeat the process.

Instead of trying to use regex to parse HTML, I suggest using a good HTML parser - the HTML Agilty Pack is a great choice:
What is exactly the Html Agility Pack (HAP)?
This is an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT (you actually don't HAVE to understand XPATH nor XSLT to use it, don't worry...). It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML documents (or streams).

Related

How to extract a specific line from a webpage in c#

HttpWebRequest myReq = (HttpWebRequest)WebRequest.Create("https://www.google.com/search?q=" + "msg");
HttpWebResponse myres = (HttpWebResponse)myReq.GetResponse();
using (StreamReader sr = new StreamReader(myres.GetResponseStream()))
{
pageContent = sr.ReadToEnd();
}
if (pageContent.Contains("find"))
{
display = "done";
}
currently what this code does is check if "find" exists on a url and display done if it is present
What I want is to display the whole line or para which contains "find".
So like instead display="done" I want to store the line which contains find in display
HTML pages don't have lines. Whitespace outside tags is ignored and an entire minified page may have no newlines at all. Even if it did, newlines are simply ignored even inside tags.That's why <br> is necessary. If you want to find a specific element you'll have to use an HTML parser like HTMLAgilityPack and identify the element using an XPath or CSS selector expression.
Copying from the landing page examples:
var url = $"https://www.google.com/search?q={msg}" ;
var web = new HtmlWeb();
var doc = web.Load(url);
var value = doc.DocumentNode
.SelectNodes("//div[#id='center_col']")
.First()
.Attributes["value"].Value;
What you put in SelectNodes depends on what you want to find.
One way to test various expressions is to open the web page you want in a browser, open the browser's Developer Tools and start searching in the Element inspector. The search functionality there accepts XPath and CSS selectors.

Read XML Value from a XML String with DOCTYPE [duplicate]

I want to extract a couple of links from an html page downloaded from the internet, I think that using linq to XML would be a good solution for my case.
My problem is that I can't create an XmlDocument from the HTML, using Load(string url) didn't work so I downloaded the html to a string using:
public static string readHTML(string url)
{
HttpWebRequest req = (HttpWebRequest)WebRequest.Create(url);
HttpWebResponse res = (HttpWebResponse)req.GetResponse();
StreamReader sr = new StreamReader(res.GetResponseStream());
string html = sr.ReadToEnd();
sr.Close();
return html;
}
When I try to load that string using LoadXml(string xml) I get the exception
'--' is an unexpected token. The expected token is '>'
What way should I take to read the html file to a parsable XML
HTML simply isn’t the same as XML (unless the HTML actually happens to be conforming XHTML or HTML5 in XML mode). The best way is to use a HTML parser to read the HTML. Afterwards you may transform it to Linq to XML – or process it directly.
I haven't used it myself, but I suggest you take a look at SgmlReader. Here's a sample from their home page:
// setup SgmlReader
Sgml.SgmlReader sgmlReader = new Sgml.SgmlReader()
{
DocType = "HTML",
WhitespaceHandling = WhitespaceHandling.All,
CaseFolding = Sgml.CaseFolding.ToLower,
InputStream = reader
};
// create document
XmlDocument doc = new XmlDocument()
{
PreserveWhitespace = true,
XmlResolver = null
};
doc.Load(sgmlReader);
return doc;
If you want to extract some links from a page, as you mentioned, try using HTML Agility Pack.
This code gets a page from the web and extracts all links:
HtmlWeb web = new HtmlWeb();
HtmlDocument document = web.Load("http://www.stackoverflow.com");
HtmlNode[] links = document.DocumentNode.SelectNodes("//a").ToArray();
Open an html file from disk and get URL for specific link:
HtmlDocument document2 = new HtmlDocument();
document2.Load(#"C:\Temp\page.html")
HtmlNode link = document2.DocumentNode.SelectSingleNode("//a[#id='myLink']");
Console.WriteLine(link.Attributes["href"].Value);
HTML is not XML. HTML is based on SGML, and as such does not ensure that the markup is well-formed XML (XML is a subset of SGML itself). You can only parse XHTML, i.e. XML compatible HTML, as XML. But of course that is not the case for most of the websites.
To work with HTML, you need to use a HTML parser.
If you know the nodes you're interested in I would use regex to extract the links from the string.

get html elements of <tr>....</tr> by creating a httpwebrequest in c#

I want to extract some html elements from the "tablerow" contents of a html code and create a automated application.Can httpwebrequest and httpwebresponse help me doing that ? if yes then
could any one show me the sample of doing that...Thanking you in advance
I would go get HtmlAgilityPack from nuget. WebClient is easier, but HttpWebRequest is more powerful and allows for more control. Regex can work, but is generally a pain. If you think this document will be well enough formatted a quick XPath to the elements in question is usually much easier and cleaner, so try something like this:
var client = new WebClient();
//var html = client.DownloadString("YOURURL");
var html = "<html><body><table><tr><td></td></tr></table></body></html>";
var document = new HtmlDocument();
document.LoadHtml(html);
var nodes = document.DocumentNode.SelectNodes("//body/table");
Console.WriteLine(nodes[0].InnerHtml);
Console.ReadLine();

Grabbing content from a website in C#

New to C# here, but I've used Java for years. I tried googling this and got a couple of answers that were not quite what I need. I'd like to grab the (X)HTML from a website and then use DOM (actually, CSS selectors are preferable, but whatever works) to grab a particular element. How exactly is this done in C#?
To get the HTML you can use the WebClient object.
To parse the HTML you can use HTMLAgility librrary.
// prepare the web page we will be asking for
HttpWebRequest request = (HttpWebRequest)
WebRequest.Create("http://www.stackoverflow.com");
// execute the request
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
// we will read data via the response stream
Stream resStream = response.GetResponseStream();
string tempString = null;
int count = 0;
do
{
// fill the buffer with data
count = resStream.Read(buf, 0, buf.Length);
// make sure we read some data
if (count != 0)
{
// translate from bytes to ASCII text
tempString = Encoding.ASCII.GetString(buf, 0, count);
// continue building the string
sb.Append(tempString);
}
}
while (count > 0); // any more data to read?
Then use Xquery expressions or Regex to grab the element you need
You could use System.Net.WebClient or System.Net.HttpWebrequest to fetch the page but parsing for the elements is not supported by the classes.
Use HtmlAgilityPack (http://html-agility-pack.net/)
HtmlWeb htmlWeb = new HtmlWeb();
htmlWeb.UseCookies = true;
HtmlDocument htmlDocument = htmlWeb.Load(url);
// after getting the document node
// you can do something like this
foreach (HtmlNode item in htmlDocument.DocumentNode.Descendants("input"))
{
// item mathces your req
// take the item.
}
I hear you want to use the HtmlAgilityPack for working with HTML files. This will give you Linq access, with is A Good Thing (tm). You can download the file with System.Net.WebClient.
You can use Html Agility Pack to load html and find the element you need.
To get you started, you can fairly easily use HttpWebRequest to get the contents of a URL. From there, you will have to do something to parse out the HTML. That is where it starts to get tricky. You can't use a normal XML parser, because many (most?) web site HTML pages aren't 100% valid XML. Web browsers have specially implemented parsers to work around the invalid portions. In Ruby, I would use something like Nokogiri to parse the HTML, so you might want to look for a .NET port of it, or another parser specificly designed to read HTML.
Edit:
Since the topic is likely to come up: WebClient vs. HttpWebRequest/HttpWebResponse
Also, thanks to the others that answered for noting HtmlAgility. I didn't know it existed.
Look into using the html agility pack, which is one of the more common libraries for parsing html.
http://htmlagilitypack.codeplex.com/

How to read HTML as XML?

I want to extract a couple of links from an html page downloaded from the internet, I think that using linq to XML would be a good solution for my case.
My problem is that I can't create an XmlDocument from the HTML, using Load(string url) didn't work so I downloaded the html to a string using:
public static string readHTML(string url)
{
HttpWebRequest req = (HttpWebRequest)WebRequest.Create(url);
HttpWebResponse res = (HttpWebResponse)req.GetResponse();
StreamReader sr = new StreamReader(res.GetResponseStream());
string html = sr.ReadToEnd();
sr.Close();
return html;
}
When I try to load that string using LoadXml(string xml) I get the exception
'--' is an unexpected token. The expected token is '>'
What way should I take to read the html file to a parsable XML
HTML simply isn’t the same as XML (unless the HTML actually happens to be conforming XHTML or HTML5 in XML mode). The best way is to use a HTML parser to read the HTML. Afterwards you may transform it to Linq to XML – or process it directly.
I haven't used it myself, but I suggest you take a look at SgmlReader. Here's a sample from their home page:
// setup SgmlReader
Sgml.SgmlReader sgmlReader = new Sgml.SgmlReader()
{
DocType = "HTML",
WhitespaceHandling = WhitespaceHandling.All,
CaseFolding = Sgml.CaseFolding.ToLower,
InputStream = reader
};
// create document
XmlDocument doc = new XmlDocument()
{
PreserveWhitespace = true,
XmlResolver = null
};
doc.Load(sgmlReader);
return doc;
If you want to extract some links from a page, as you mentioned, try using HTML Agility Pack.
This code gets a page from the web and extracts all links:
HtmlWeb web = new HtmlWeb();
HtmlDocument document = web.Load("http://www.stackoverflow.com");
HtmlNode[] links = document.DocumentNode.SelectNodes("//a").ToArray();
Open an html file from disk and get URL for specific link:
HtmlDocument document2 = new HtmlDocument();
document2.Load(#"C:\Temp\page.html")
HtmlNode link = document2.DocumentNode.SelectSingleNode("//a[#id='myLink']");
Console.WriteLine(link.Attributes["href"].Value);
HTML is not XML. HTML is based on SGML, and as such does not ensure that the markup is well-formed XML (XML is a subset of SGML itself). You can only parse XHTML, i.e. XML compatible HTML, as XML. But of course that is not the case for most of the websites.
To work with HTML, you need to use a HTML parser.
If you know the nodes you're interested in I would use regex to extract the links from the string.

Categories