Extracting url links within a downloaded txt file - c#

Currently working on a url extractor for work. I'm trying to extract all http links/ href links from a downloaded html file and print the links on there own in a separate txt file.So far I've managed to get the entire html of a page downloaded its just extracting the links from it and printing them using Regex is a problem. Wondering if anyone could help me with this?
private void button2_Click(object sender, EventArgs e)
{
Uri fileURI = new Uri(URLbox2.Text);
WebRequest request = WebRequest.Create(fileURI);
request.Credentials = CredentialCache.DefaultCredentials;
WebResponse response = request.GetResponse();
Console.WriteLine(((HttpWebResponse)response).StatusDescription);
Stream dataStream = response.GetResponseStream();
StreamReader reader = new StreamReader(dataStream);
string responseFromServer = reader.ReadToEnd();
SW = File.CreateText("C:\\Users\\Conal_Curran\\OneDrive\\C#\\MyProjects\\Web Crawler\\URLTester\\response1.htm");
SW.WriteLine(responseFromServer);
SW.Close();
string text = System.IO.File.ReadAllText(#"C:\\Users\\Conal_Curran\\OneDrive\\C#\\MyProjects\\Web Crawler\\URLTester\\response1.htm");
string[] links = System.IO.File.ReadAllLines(#"C:\\Users\\Conal_Curran\\OneDrive\\C#\\MyProjects\\Web Crawler\\URLTester\\response1.htm");
Regex regx = new Regex(links, #"http://([\\w+?\\.\\w+])+([a-zA-Z0-9\\~\\!\\#\\#\\$\\%\\^\\&\\*\\(\\)_\\-\\=\\+\\\\\\/\\?\\.\\:\\;\\'\\,]*)?", RegexOptions.IgnoreCase);
MatchCollection mactches = regx.Matches(text);
foreach (Match match in mactches)
{
text = text.Replace(match.Value, "<a href='" + match.Value + "'>" + match.Value + "</a>");
}
SW = File.CreateText("C:\\Users\\Conal_Curran\\OneDrive\\C#\\MyProjects\\Web Crawler\\URLTester\\Links.htm");
SW.WriteLine(links);
}

In case you do not know, this can be achieved (pretty easily) using one of the html parser nuget packages available.
I personally use HtmlAgilityPack (along with ScrapySharp, another package) and AngleSharp.
With only the 3 lines above, you have all the hrefs in the document loaded by your http get request, using HtmlAgilityPack:
/*
do not forget to include the usings:
using HtmlAgilityPack;
using ScrapySharp.Extensions;
*/
HtmlWeb w = new HtmlWeb();
//since you have your html locally stored, you do the following:
//P.S: By prefixing file path strings with #, you are rid of having to escape slashes and other fluffs.
var doc = HtmlDocument.LoadHtml(#"C:\Users\Conal_Curran\OneDrive\C#\MyProjects\Web Crawler\URLTester\response1.htm");
//for an http get request
//var doc = w.Load("yourAddressHere");
var hrefs = doc.DocumentNode.CssSelect("a").Select(a => a.GetAttributeValue("href"));

Related

how to download html web pages using C#

I need to download a number of web pages
with a pattern like
../directory/page_1.html
../directory/page_2.html
...
../directory/page_80.html
and save them into a folder can i do this simple task using c# ?, please help. any suggestions
for(i=1;i<81;i++){
string url = "http://mywebsite.com/cats/page_" + convert.tostring(i)
+ ".html"
//
// code to download the html page
//
}
new WebClient ().DownloadFile("http://abc.html", #"C:\downloadedFile_abc.html");
// Or you can get the file content without saving it
string htmlCode = new WebClient ().DownloadString("http://abc.html");
for(i=1;i<81;i++){
string url = "http://mywebsite.com/cats/page_" + convert.tostring(i)
+ ".html"
using (WebClient client = new WebClient ())
{
//you can download as a html file
client.DownloadFile(url, #"C:\" + i.ToString() + ".html");
// Or you can get html source code as a string
string htmlCode = client.DownloadString(url);
}
}
And if you want to get some data from html code(such as table that has "xx" className) you can use HTMLAGILITYPACK

HTML to PDF conversion using Aspose

I am new to Aspose but I have successfully converted several file formats into PDF's but I am struck with HTML to PDF conversion. I am able to convert a HTML file into a PDF successfully but the CSS part is not rendering into the generated PDF. Any idea on this? I saved www.google.com as my input HTML file. Here is my controller code.
using Aspose.Pdf.Generator
Pdf pdf = new Pdf();
pdf.HtmlInfo.CharSet = "UTF-8";
Section section = pdf.Sections.Add();
StreamReader r = File.OpenText(#"Local HTML File Path");
Text text2 = new Aspose.Pdf.Generator.Text(section, r.ReadToEnd());
pdf.HtmlInfo.ExternalResourcesBasePath = "Local HTML File Path";
text2.IsHtmlTagSupported = true;
text2.IsFitToPage = true;
section.Paragraphs.Add(text2);
pdf.Save(#"Generated PDF File Path");
Am i missing something? Any kind of help is greatly appreciated.
Thanks
My name is Tilal Ahmad and I am developer evangelist at Aspose.
Please use new DOM approach(Aspose.Pdf.Document) for HTML to PDF conversion. In this approach to render external resources(CSS/Images/Fonts) you need to pass resources path to HtmlLoadOptions() method. Please check following documentation links for the purpose.
Convert HTML to PDF (new DOM)
HtmlLoadOptions options = new HtmlLoadOptions(resourcesPath);
Document pdfDocument = new Document(inputPath, options);
pdfDocument.Save("outputPath");
Convert Webpage to PDF(new DOM)
// Create a request for the URL.
WebRequest request = WebRequest.Create("https:// En.wikipedia.org/wiki/Main_Page");
// If required by the server, set the credentials.
request.Credentials = CredentialCache.DefaultCredentials;
// Time out in miliseconds before the request times out
// Request.Timeout = 100;
// Get the response.
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
// Get the stream containing content returned by the server.
Stream dataStream = response.GetResponseStream();
// Open the stream using a StreamReader for easy access.
StreamReader reader = new StreamReader(dataStream);
// Read the content.
string responseFromServer = reader.ReadToEnd();
reader.Close();
dataStream.Close();
response.Close();
MemoryStream stream = new MemoryStream(System.Text.Encoding.UTF8.GetBytes(responseFromServer));
HtmlLoadOptions options = new HtmlLoadOptions("https:// En.wikipedia.org/wiki/");
// Load HTML file
Document pdfDocument = new Document(stream, options);
options.PageInfo.IsLandscape = true;
// Save output as PDF format
pdfDocument.Save(outputPath);
Try using media attribute in each style tag
<style media="print">
and then provide the html file to your Aspose.Pdf Generator.
Try this.. This is working nice for me
var license = new Aspose.Pdf.License();
license.SetLicense("Aspose.Pdf.lic");
var license = new Aspose.Html.License();
license.SetLicense("Aspose.Html.lic");
using (MemoryStream memoryStream = new MemoryStream())
{
var options = new PdfRenderingOptions();
using (PdfDevice pdfDevice = new PdfDevice(options, memoryStream))
{
using (var renderer = new HtmlRenderer())
{
using (HTMLDocument htmlDocument = new HTMLDocument(content, ""))
{
renderer.Render(pdfDevice, htmlDocument);
//Save memoryStream into output pdf file
}
}
}
}
content is string type which is my html content.

How to properly get the content of a website?

I'm trying to read the content of the page and extract some information. But sometimes I got stuff like : nbsp;Aur& eacute;lie (Verschuere)
I already do this:
string siteContent = "";
using (System.Net.WebClient client = new System.Net.WebClient())
{
client.Encoding = System.Text.Encoding.UTF8;
siteContent = client.DownloadString(edtReadFromUrl.Text);
}
It works when there are UTF-8 characters. Can't I get a readable text? with no HTML in it? It would be even easier.
Edit: It's not the same as someone marked it. It does return strange characters with the other solution too.
You could use an html parser to extract meaning. For instance, with HtmlAgilityPack, you could:
HtmlDocument doc=new HtmlDocument();
string html;
using(var wc=new WebClient())
{
html=wc.DownloadString("http://www.bbc.co.uk/news");
}
doc.LoadHtml(html);
doc.DocumentNode.Element("html").Element("body").InnerText

webclient htmlagility pack web parsing

C# + webclient + htmlagility pack + web parsing
I wanted to go through the list of the jobs of this page but i can't parse those links because it changes.
One of the example, when i see the link as it is in the browser(Link),,
when i parse it using webclient and htmlagilitypack i get the changed link
Do i have to do settings on webclient? to include sessions or scripts?
Here is my code on that..
private void getLinks()
{
StreamReader sr = new StreamReader("categories.txt");
while(!sr.EndOfStream)
{
string url = sr.ReadLine();
WebClient wc = new WebClient();
string source = wc.DownloadString(url);
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(source);
HtmlNodeCollection nodes = doc.DocumentNode.SelectNodes(".//a[#class='internerLink primaerElement']");
foreach (HtmlNode node in nodes)
{
Console.WriteLine("http://jobboerse.arbeitsagentur.de" + node.Attributes["href"].Value);
}
}
sr.Close();
}
You may try a WebBrowser class (http://msdn.microsoft.com/en-us/library/system.windows.controls.webbrowser%28v=vs.110%29.aspx) and then use its DOM Accessing DOM from WebBrowser to retrieve the links.
mshtml.IHTMLDocument2 htmlDoc = webBrowser.Document as mshtml.IHTMLDocument2;
// do something like find button and click
htmlDoc.all.item("testBtn").click();

Web crawler time out

I am working on a simple web crawler to get a URL, Crawl first level links on the site and extract mails from all pages using RegEx...
I know it's kinda sloppy and it's just the beginning, but i always get "operation Timed Out" after 2 minutes of running the script..
private void button1_Click(object sender, System.EventArgs e)
{
string url = textBox1.Text;
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
StreamReader sr = new StreamReader(response.GetResponseStream());
string code = sr.ReadToEnd();
string re = "href=\"(.*?)\"";
MatchCollection href = Regex.Matches(code, #re, RegexOptions.Singleline);
foreach (Match h in href)
{
string link = h.Groups[1].Value;
if (!link.Contains("http://"))
{
HttpWebRequest request2 = (HttpWebRequest)WebRequest.Create(url + link);
HttpWebResponse response2 = (HttpWebResponse)request2.GetResponse();
StreamReader sr2 = new StreamReader(response.GetResponseStream());
string innerlink = sr.ReadToEnd();
MatchCollection m2 = Regex.Matches(code, #"([\w-]+(\.[\w-]+)*#([a-z0-9-]+(\.[a-z0-9-]+)*?\.[a-z]{2,6}|(\d{1,3}\.){3}\d{1,3})(:\d{4})?)", RegexOptions.Singleline);
foreach (Match m in m2)
{
string email = m.Groups[1].Value;
if (!listBox1.Items.Contains(email))
{
listBox1.Items.Add(email);
}
}
}
}
sr.Close();
}
Don't parse Html using Regex. Use the Html Agility Pack for that.
What is exactly the Html Agility Pack (HAP)?
This is an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT (you actually don't HAVE to understand XPATH nor XSLT to use it, don't worry...). It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML documents (or streams).
More Information
Html Agility Pack on Codeplex
How to use HTML Agility pack
The comment by Oded is correct, we need to know what you need help with specifically; however, I can at least point you to the HtmlAgility Pack as it will solve most of your web scraping woes.
Good Luck!
Matt

Categories