C# Getting The HTML of the Links(Content) from Website

C# Getting The HTML of the Links(Content) from Website - c#

What I want is, opening a Link from a Website (from HtmlContent)
and get the Html of this new opened site..
Example: I have www.google.com, now I want to find all Links.
For each Link I want to have the HTMLContent of the new Site.
I do something like this:
foreach (String link in GetLinksFromWebsite(htmlContent))
{
using (var client = new WebClient())
{
htmlContent = client.DownloadString("http://" + link);
}
foreach (Match treffer in istBildURL)
{
string bildUrl = treffer.Groups[1].Value;
bildLinks.Add(bildUrl);
}
}
public static List<String> GetLinksFromWebsite(string htmlSource)
{
string linkPattern = "(.*?)";
MatchCollection linkMatches = Regex.Matches(htmlSource, linkPattern, RegexOptions.Singleline);
List<string> linkContents = new List<string>();
foreach (Match match in linkMatches)
{
linkContents.Add(match.Value);
}
return linkContents;
}
The other problem is, that I only get Links, not Linkbuttons (ASP.NET)..
How can I solve the problem?

Steps to follow:
Download Html Agility Pack
Reference the assembly you have downloaded in your project
Throw everything that starts with the word regex or regular expression out from your project and which deals with parsing HTML (read this answer to better understand why). In your case this would be the contents of the GetLinksFromWebsite method.
Replace what you have thrown away with a simple call to the Html Agility Pack parser.
Here's an example:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Net;
using HtmlAgilityPack;
class Program
{
static void Main()
{
using (var client = new WebClient())
{
var htmlSource = client.DownloadString("http://www.stackoverflow.com");
foreach (var item in GetLinksFromWebsite(htmlSource))
{
// TODO: you could easily write a recursive function
// that will call itself here and retrieve the respective contents
// of the site ...
Console.WriteLine(item);
}
}
}
public static List<String> GetLinksFromWebsite(string htmlSource)
{
var doc = new HtmlDocument();
doc.LoadHtml(htmlSource);
return doc
.DocumentNode
.SelectNodes("//a[#href]")
.Select(node => node.Attributes["href"].Value)
.ToList();
}
}

Related

How to replace html tags while writing web scraper in C#?

I am writing console application for web crawling and scraping in C# just for learning purpose only. When result is displayed, some of the values are displayed along with the html tags, infact tags. I figured out the strong tags and replaced them completely. But what if there were many strong tags with different inline styling values?
How could I solve this problem ?
Well the problem is in GetData() function
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using System.Web;
using System.Net;
using System.IO;
using System.Text.RegularExpressions;
namespace MyCrawler
{
public class Program
{
public static string GetContent(string url)
{
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);
WebResponse response = request.GetResponse();
StreamReader reader = new StreamReader(response.GetResponseStream());
string line = "";
StringBuilder builder = new StringBuilder();
while ((line = reader.ReadLine()) != null)
{
builder.Append(line.Trim());
}
reader.Close();
return builder.ToString().Replace("\n", "");
}
public static void GetData(string content)
{
// these tags are to be replaced
string ToBeReplaced1 = "<strong style=\"color:#F00\">"; //
string ToBeReplaced2 = "</strong>";
string ToBeReplaced3 = "<strong style=\"color:#000099\">";
// pattern for regular expression
string pattern3 = "<dt>(.*?)</dt><dd>(.*?)</dd>";
Regex regex = new Regex(pattern3);
MatchCollection mc = regex.Matches(content);
foreach(Match m2 in mc)
{
Console.Write(m2.Groups[1].Value);
Console.WriteLine(((m2.Groups[2].Value.Replace(ToBeReplaced3, "")).Replace(ToBeReplaced1, "")).Replace(ToBeReplaced2, ""));
}
Console.WriteLine();
}
public static void Main(string[] args)
{
string url = "http://www.merojob.com/";
string content = GetContent(url);
string pattern = "<div class=\"employername\"><h2>(.*?)</h2><a href=\"(.*?)\"";
Regex regex = new Regex(pattern);
MatchCollection mc = regex.Matches(content);
foreach (Match m in mc)
{
foreach (Capture c in m.Groups[2].Captures)
{
//Console.WriteLine(c.Value); // write the value to the console "pattern"
content = GetContent(c.Value);
GetData(content);
}
}
Console.ReadKey();
}
}
}
Well, if I dont use Replace() function, I end up with :

The best way in your case would be to use a dedicated library, such as HtmlAgilityPack to be able to retrieve specific tags and manipulate the structure of your DOM document. Doing it manually is a recipe for pain. Doing it with regular expressions may endanger your mind so use a library to handle your html
Even if this is for learning purposes only, you are not really using the right tool or exercice to start learning, since this is a really complicated subject.

Get Links From Web Page [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 9 years ago.
Improve this question
I need to get all the item links(URLs) from this webpage into a text file delimited by breaks (in other words a list like so: "Item #1" "Item #2" etc.
http://dota-trade.com/equipment?order=name is the webpage and if you scroll down it goes on and on to about ~500-1000 items.
What programming language would I have to use or how would I be able to do this. I also have experience using imacros already.

You need to download HtmlAgilityPack
using HtmlAgilityPack;
using System;
using System.Collections.Generic;
using System.Linq;
using System.Net;
using System.Text;
using System.Threading.Tasks;
namespace ConsoleApplication5
{
class Program
{
static void Main(string[] args)
{
WebClient wc = new WebClient();
var sourceCode = wc.DownloadString("http://dota-trade.com/equipment?order=name");
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(sourceCode);
var node = doc.DocumentNode;
var nodes = node.SelectNodes("//a");
List<string> links = new List<string>();
foreach (var item in nodes)
{
var link = item.Attributes["href"].Value;
links.Add(link.Contains("http") ? link : "http://dota-trade.com" +link);
}
int index = 1;
while (true)
{
sourceCode = wc.DownloadString("http://dota-trade.com/equipment?order=name&offset=" + index.ToString());
doc = new HtmlDocument();
doc.LoadHtml(sourceCode);
node = doc.DocumentNode;
nodes = node.SelectNodes("//a");
var cont = node.SelectSingleNode("//tr[#itemtype='http://schema.org/Thing']");
if (cont == null) break;
foreach (var item in nodes)
{
var link = item.Attributes["href"].Value;
links.Add(link.Contains("http") ? link : "http://dota-trade.com" + link);
}
index++;
}
System.IO.File.WriteAllLines(#"C:\Users\Public\WriteLines.txt", links);
}
}
}

I would recommend using any language with Regular Expression support. I use ruby a lot so I would do something like this:
require 'net/http'
require 'uri'
uri = URI.parse("http://dota-trade.com/equipment?order=name")
req = Net::HTTP::Get(uri.path)
http = Net::HTTP.new(uri.host, uri.port)
response = http.request(request)
links = response.body.match(/<a.+?href="(.+?)"/)
This is off the top of my head but links[0] should be a match object, every element there after is a match.
puts links[1..-1].join("\n")
The last line should dump what you want but likely won't include the host. If you want the host included do something like this:
puts links[1..-1].map{|l| "http://dota-trade.com" + l }.join("\n")
Please keep in mind this is untested.

GetSafeHtmlFragment removing all html tags

I am using GetSafeHtmlFragment in my website and I found that all of tags except <p> and <a> is removed.
I researched around and I found that there is no resolution for it from Microsoft.
Is there any superseded for it or is there any solution?
Thanks.

Amazing that Microsoft in the 4.2.1 version terribly overcompensated for a security leak in the 4.2 XSS library and now still hasn't updated a year later. The GetSafeHtmlFragment method should have been renamed to StripHtml as I read someone commenting somewhere.
I ended up using the HtmlSanitizer library suggested in this related SO issue. I liked that it was available as a package through NuGet.
This library basically implements a variation of the white-list approach the now accepted answer uses. However it is based on CsQuery instead of the HTML Agility library. The package also gives some additional options, like being able to keep style information (e.g. HTML attributes). Using this library resulted in code in my project something like below, which - at least - is a lot less code than the accepted answer :).
using Html;
...
var sanitizer = new HtmlSanitizer();
sanitizer.AllowedTags = new List<string> { "p", "ul", "li", "ol", "br" };
string sanitizedHtml = sanitizer.Sanitize(htmlString);

An alternative solution would be to use the Html Agility Pack in conjunction with your own tags white list :
using System;
using System.IO;
using System.Text;
using System.Linq;
using System.Collections.Generic;
using HtmlAgilityPack;
class Program
{
static void Main(string[] args)
{
var whiteList = new[]
{
"#comment", "html", "head",
"title", "body", "img", "p",
"a"
};
var html = File.ReadAllText("input.html");
var doc = new HtmlDocument();
doc.LoadHtml(html);
var nodesToRemove = new List<HtmlAgilityPack.HtmlNode>();
var e = doc
.CreateNavigator()
.SelectDescendants(System.Xml.XPath.XPathNodeType.All, false)
.GetEnumerator();
while (e.MoveNext())
{
var node =
((HtmlAgilityPack.HtmlNodeNavigator)e.Current)
.CurrentNode;
if (!whiteList.Contains(node.Name))
{
nodesToRemove.Add(node);
}
}
nodesToRemove.ForEach(node => node.Remove());
var sb = new StringBuilder();
using (var w = new StringWriter(sb))
{
doc.Save(w);
}
Console.WriteLine(sb.ToString());
}
}

A C# parser for Web Links (RFC 5988)

Anyone created an open source C# parser for Web Links HTTP "Link" header?
See: https://www.rfc-editor.org/rfc/rfc5988.
Example:
Link: <http://example.com/TheBook/chapter2>; rel="previous"; title="previous chapter"
Thanks.
Update: Ended up creating my own parser: https://github.com/JornWildt/Ramone/blob/master/Ramone/Utility/WebLinkParser.cs. Feel free to use it.

Ended up creating my own parser: https://github.com/JornWildt/Ramone/blob/master/Ramone/Utility/WebLinkParser.cs. Feel free to use it.

Here's an extension method I've used:
public static Dictionary<string, string> ParseLinksHeader(
this HttpResponseMessage response)
{
var links = new Dictionary<string, string>();
response.Headers.TryGetValues("link", out var headers);
if (headers == null) return links;
var matches = Regex.Matches(
headers.First(),
#"<(?<url>[^>]*)>;\s+rel=""(?<link>\w+)\""");
foreach(Match m in matches)
links.Add(m.Groups["link"].Value, m.Groups["url"].Value);
return links;
}

Take the HTML Agility Pack and use the right
SelectNodes
query.
using HtmlAgilityPack;
namespace WebScraper
{
class Program
{
static void Main(string[] args)
{
HtmlWeb web = new HtmlWeb();
HtmlDocument doc =web.Load(url);
foreach (HtmlNode link in doc.DocumentNode.SelectNodes("//a[#Link]"))
{
}

Troubles with HtmlAgilityPack

I can't figure out what goes wrong. i just create the poject to test HtmlAgilityPack and what i've got.
using System;
using System.Collections.Generic;
using System.Text;
using HtmlAgilityPack;
namespace parseHabra
{
class Program
{
static void Main(string[] args)
{
HTTP net = new HTTP(); //some http wraper
string result = net.MakeRequest("http://stackoverflow.com/", null);
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(result);
//Get all summary blocks
HtmlNodeCollection news = doc.DocumentNode.SelectNodes("//div[#class=\"summary\"]");
foreach (HtmlNode item in news)
{
string title = String.Empty;
//trouble is here for each element item i get the same value
//all the time
title = item.SelectSingleNode("//a[#class=\"question-hyperlink\"]").InnerText.Trim();
Console.WriteLine(title);
}
Console.ReadLine();
}
}
}
It looks like i make xpath not for each node i've selected but to whole document. Any suggestions why it so ? Thx in advance.

I have not tried your code, but from the quick look I suspect the problem is that the // is searching from the root of the entire document and not the root of the current element as I guess you are expecting.
Try putting a . before the //
".//a[#class=\"question-hyperlink\"]"

I'd rewrite your xpath as a single query to find all the question titles, rather than finding the summaries then the titles. Chris' answer points out the problem which could have easily been avoided.
var web = new HtmlWeb();
var doc = web.Load("http://stackoverflow.com");
var xpath = "//div[starts-with(#id,'question-summary-')]//a[#class='question-hyperlink']";
var questionTitles = doc.DocumentNode
.SelectNodes(xpath)
.Select(a => a.InnerText.Trim());

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

C# Getting The HTML of the Links(Content) from Website - c#

Related

How to replace html tags while writing web scraper in C#?

Get Links From Web Page [closed]

GetSafeHtmlFragment removing all html tags

A C# parser for Web Links (RFC 5988)

Troubles with HtmlAgilityPack

Categories

Resources