how to download html web pages using C# - c#

I need to download a number of web pages
with a pattern like
../directory/page_1.html
../directory/page_2.html
...
../directory/page_80.html
and save them into a folder can i do this simple task using c# ?, please help. any suggestions
for(i=1;i<81;i++){
string url = "http://mywebsite.com/cats/page_" + convert.tostring(i)
+ ".html"
//
// code to download the html page
//
}

new WebClient ().DownloadFile("http://abc.html", #"C:\downloadedFile_abc.html");
// Or you can get the file content without saving it
string htmlCode = new WebClient ().DownloadString("http://abc.html");

for(i=1;i<81;i++){
string url = "http://mywebsite.com/cats/page_" + convert.tostring(i)
+ ".html"
using (WebClient client = new WebClient ())
{
//you can download as a html file
client.DownloadFile(url, #"C:\" + i.ToString() + ".html");
// Or you can get html source code as a string
string htmlCode = client.DownloadString(url);
}
}
And if you want to get some data from html code(such as table that has "xx" className) you can use HTMLAGILITYPACK

Related

Read view source of any web page url and download img in local folder from it in c#

I want to read a view source of any web page and want to download all images in a folder.
I use below code to read page source:
string address = "http://stackoverflow.com/"; //any web site url
using (WebClient wc = new WebClient())
{
txtRead .Text= wc.DownloadString(address);
}
But in this view source how to get only img src and download images in a local folder.
Thanks,
Hitesh
If you use HtmlAgilityPack you can do something like this
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(yourHtml);
foreach(HtmlNode imagein doc.DocumentElement.SelectNodes("//img)
{
HtmlAttribute src = link["src"];
CodeToDownloadTheImage(src);
}
See this question for more info.

Extracting url links within a downloaded txt file

Currently working on a url extractor for work. I'm trying to extract all http links/ href links from a downloaded html file and print the links on there own in a separate txt file.So far I've managed to get the entire html of a page downloaded its just extracting the links from it and printing them using Regex is a problem. Wondering if anyone could help me with this?
private void button2_Click(object sender, EventArgs e)
{
Uri fileURI = new Uri(URLbox2.Text);
WebRequest request = WebRequest.Create(fileURI);
request.Credentials = CredentialCache.DefaultCredentials;
WebResponse response = request.GetResponse();
Console.WriteLine(((HttpWebResponse)response).StatusDescription);
Stream dataStream = response.GetResponseStream();
StreamReader reader = new StreamReader(dataStream);
string responseFromServer = reader.ReadToEnd();
SW = File.CreateText("C:\\Users\\Conal_Curran\\OneDrive\\C#\\MyProjects\\Web Crawler\\URLTester\\response1.htm");
SW.WriteLine(responseFromServer);
SW.Close();
string text = System.IO.File.ReadAllText(#"C:\\Users\\Conal_Curran\\OneDrive\\C#\\MyProjects\\Web Crawler\\URLTester\\response1.htm");
string[] links = System.IO.File.ReadAllLines(#"C:\\Users\\Conal_Curran\\OneDrive\\C#\\MyProjects\\Web Crawler\\URLTester\\response1.htm");
Regex regx = new Regex(links, #"http://([\\w+?\\.\\w+])+([a-zA-Z0-9\\~\\!\\#\\#\\$\\%\\^\\&\\*\\(\\)_\\-\\=\\+\\\\\\/\\?\\.\\:\\;\\'\\,]*)?", RegexOptions.IgnoreCase);
MatchCollection mactches = regx.Matches(text);
foreach (Match match in mactches)
{
text = text.Replace(match.Value, "<a href='" + match.Value + "'>" + match.Value + "</a>");
}
SW = File.CreateText("C:\\Users\\Conal_Curran\\OneDrive\\C#\\MyProjects\\Web Crawler\\URLTester\\Links.htm");
SW.WriteLine(links);
}
In case you do not know, this can be achieved (pretty easily) using one of the html parser nuget packages available.
I personally use HtmlAgilityPack (along with ScrapySharp, another package) and AngleSharp.
With only the 3 lines above, you have all the hrefs in the document loaded by your http get request, using HtmlAgilityPack:
/*
do not forget to include the usings:
using HtmlAgilityPack;
using ScrapySharp.Extensions;
*/
HtmlWeb w = new HtmlWeb();
//since you have your html locally stored, you do the following:
//P.S: By prefixing file path strings with #, you are rid of having to escape slashes and other fluffs.
var doc = HtmlDocument.LoadHtml(#"C:\Users\Conal_Curran\OneDrive\C#\MyProjects\Web Crawler\URLTester\response1.htm");
//for an http get request
//var doc = w.Load("yourAddressHere");
var hrefs = doc.DocumentNode.CssSelect("a").Select(a => a.GetAttributeValue("href"));

How to read xml from url C#

I can't read xml string from http://158.58.185.214/Applications/Operator/Files/Data/Bus/CityList.xml and i think the encoding is the problem please help to solve it.
my code is:
string url = "http://158.58.185.214/Applications/Operator/Files/Data/Bus/CityList.xml";
WebClient client = new WebClient();
string xml = client.DownloadString(url);
but the xml string is:
‹ í½`I–%&/mÊ{JõJ×àt¡€`$Ø#ìÁˆÍæ’ìiG#....
your problem can be solve like this
using System.Xml;
String URLString = " http://localhost/books.xml";
XmlTextReader reader = new XmlTextReader (URLString);
while (reader.Read())
{
// Do some work here on the data.
Console.WriteLine(reader.Name);
}
Console.ReadLine();
refer this:https://support.microsoft.com/kb/307643/en-us

webclient htmlagility pack web parsing

C# + webclient + htmlagility pack + web parsing
I wanted to go through the list of the jobs of this page but i can't parse those links because it changes.
One of the example, when i see the link as it is in the browser(Link),,
when i parse it using webclient and htmlagilitypack i get the changed link
Do i have to do settings on webclient? to include sessions or scripts?
Here is my code on that..
private void getLinks()
{
StreamReader sr = new StreamReader("categories.txt");
while(!sr.EndOfStream)
{
string url = sr.ReadLine();
WebClient wc = new WebClient();
string source = wc.DownloadString(url);
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(source);
HtmlNodeCollection nodes = doc.DocumentNode.SelectNodes(".//a[#class='internerLink primaerElement']");
foreach (HtmlNode node in nodes)
{
Console.WriteLine("http://jobboerse.arbeitsagentur.de" + node.Attributes["href"].Value);
}
}
sr.Close();
}
You may try a WebBrowser class (http://msdn.microsoft.com/en-us/library/system.windows.controls.webbrowser%28v=vs.110%29.aspx) and then use its DOM Accessing DOM from WebBrowser to retrieve the links.
mshtml.IHTMLDocument2 htmlDoc = webBrowser.Document as mshtml.IHTMLDocument2;
// do something like find button and click
htmlDoc.all.item("testBtn").click();

Getting all the anchor tags of a web page

Given a web URL, I want to detect all the links in a WEBSITE, identify the internal links and list them.
What I have is this:
WebClient webClient = null;
webClient = new WebClient();
string strUrl = "http://www.anysite.com";
string completeHTMLCode = "";
try
{
completeHTMLCode = webClient.DownloadString(strUrl);
}
catch (Exception)
{
}
Using this I can read the contents of the page....but the only idea I have in my mind is parsing this string....searching for <a then href then the value between the double quotes.
Is this the only way out? Or there lies some other better solution(s)?
Use the HTML Agility Pack. Here's a link to a blog post to get you started. Do not use Regex.
using HtmlAgilityPack
completeHTMLCode =
webClient.DownloadString(strUrl);
doc.Load(completeHTMLCode);
foreach(HtmlNode link in doc.DocumentElement.SelectNodes("//a[#a"])
{
//
}

Categories