I need to download a number of web pages
with a pattern like
../directory/page_1.html
../directory/page_2.html
...
../directory/page_80.html
and save them into a folder can i do this simple task using c# ?, please help. any suggestions
for(i=1;i<81;i++){
string url = "http://mywebsite.com/cats/page_" + convert.tostring(i)
+ ".html"
//
// code to download the html page
//
}
new WebClient ().DownloadFile("http://abc.html", #"C:\downloadedFile_abc.html");
// Or you can get the file content without saving it
string htmlCode = new WebClient ().DownloadString("http://abc.html");
for(i=1;i<81;i++){
string url = "http://mywebsite.com/cats/page_" + convert.tostring(i)
+ ".html"
using (WebClient client = new WebClient ())
{
//you can download as a html file
client.DownloadFile(url, #"C:\" + i.ToString() + ".html");
// Or you can get html source code as a string
string htmlCode = client.DownloadString(url);
}
}
And if you want to get some data from html code(such as table that has "xx" className) you can use HTMLAGILITYPACK
Related
I want to read a view source of any web page and want to download all images in a folder.
I use below code to read page source:
string address = "http://stackoverflow.com/"; //any web site url
using (WebClient wc = new WebClient())
{
txtRead .Text= wc.DownloadString(address);
}
But in this view source how to get only img src and download images in a local folder.
Thanks,
Hitesh
If you use HtmlAgilityPack you can do something like this
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(yourHtml);
foreach(HtmlNode imagein doc.DocumentElement.SelectNodes("//img)
{
HtmlAttribute src = link["src"];
CodeToDownloadTheImage(src);
}
See this question for more info.
Currently working on a url extractor for work. I'm trying to extract all http links/ href links from a downloaded html file and print the links on there own in a separate txt file.So far I've managed to get the entire html of a page downloaded its just extracting the links from it and printing them using Regex is a problem. Wondering if anyone could help me with this?
private void button2_Click(object sender, EventArgs e)
{
Uri fileURI = new Uri(URLbox2.Text);
WebRequest request = WebRequest.Create(fileURI);
request.Credentials = CredentialCache.DefaultCredentials;
WebResponse response = request.GetResponse();
Console.WriteLine(((HttpWebResponse)response).StatusDescription);
Stream dataStream = response.GetResponseStream();
StreamReader reader = new StreamReader(dataStream);
string responseFromServer = reader.ReadToEnd();
SW = File.CreateText("C:\\Users\\Conal_Curran\\OneDrive\\C#\\MyProjects\\Web Crawler\\URLTester\\response1.htm");
SW.WriteLine(responseFromServer);
SW.Close();
string text = System.IO.File.ReadAllText(#"C:\\Users\\Conal_Curran\\OneDrive\\C#\\MyProjects\\Web Crawler\\URLTester\\response1.htm");
string[] links = System.IO.File.ReadAllLines(#"C:\\Users\\Conal_Curran\\OneDrive\\C#\\MyProjects\\Web Crawler\\URLTester\\response1.htm");
Regex regx = new Regex(links, #"http://([\\w+?\\.\\w+])+([a-zA-Z0-9\\~\\!\\#\\#\\$\\%\\^\\&\\*\\(\\)_\\-\\=\\+\\\\\\/\\?\\.\\:\\;\\'\\,]*)?", RegexOptions.IgnoreCase);
MatchCollection mactches = regx.Matches(text);
foreach (Match match in mactches)
{
text = text.Replace(match.Value, "<a href='" + match.Value + "'>" + match.Value + "</a>");
}
SW = File.CreateText("C:\\Users\\Conal_Curran\\OneDrive\\C#\\MyProjects\\Web Crawler\\URLTester\\Links.htm");
SW.WriteLine(links);
}
In case you do not know, this can be achieved (pretty easily) using one of the html parser nuget packages available.
I personally use HtmlAgilityPack (along with ScrapySharp, another package) and AngleSharp.
With only the 3 lines above, you have all the hrefs in the document loaded by your http get request, using HtmlAgilityPack:
/*
do not forget to include the usings:
using HtmlAgilityPack;
using ScrapySharp.Extensions;
*/
HtmlWeb w = new HtmlWeb();
//since you have your html locally stored, you do the following:
//P.S: By prefixing file path strings with #, you are rid of having to escape slashes and other fluffs.
var doc = HtmlDocument.LoadHtml(#"C:\Users\Conal_Curran\OneDrive\C#\MyProjects\Web Crawler\URLTester\response1.htm");
//for an http get request
//var doc = w.Load("yourAddressHere");
var hrefs = doc.DocumentNode.CssSelect("a").Select(a => a.GetAttributeValue("href"));
I can't read xml string from http://158.58.185.214/Applications/Operator/Files/Data/Bus/CityList.xml and i think the encoding is the problem please help to solve it.
my code is:
string url = "http://158.58.185.214/Applications/Operator/Files/Data/Bus/CityList.xml";
WebClient client = new WebClient();
string xml = client.DownloadString(url);
but the xml string is:
‹ í½`I–%&/mÊ{JõJ×àt¡€`$Ø#ìÁˆÍæ’ìiG#....
your problem can be solve like this
using System.Xml;
String URLString = " http://localhost/books.xml";
XmlTextReader reader = new XmlTextReader (URLString);
while (reader.Read())
{
// Do some work here on the data.
Console.WriteLine(reader.Name);
}
Console.ReadLine();
refer this:https://support.microsoft.com/kb/307643/en-us
C# + webclient + htmlagility pack + web parsing
I wanted to go through the list of the jobs of this page but i can't parse those links because it changes.
One of the example, when i see the link as it is in the browser(Link),,
when i parse it using webclient and htmlagilitypack i get the changed link
Do i have to do settings on webclient? to include sessions or scripts?
Here is my code on that..
private void getLinks()
{
StreamReader sr = new StreamReader("categories.txt");
while(!sr.EndOfStream)
{
string url = sr.ReadLine();
WebClient wc = new WebClient();
string source = wc.DownloadString(url);
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(source);
HtmlNodeCollection nodes = doc.DocumentNode.SelectNodes(".//a[#class='internerLink primaerElement']");
foreach (HtmlNode node in nodes)
{
Console.WriteLine("http://jobboerse.arbeitsagentur.de" + node.Attributes["href"].Value);
}
}
sr.Close();
}
You may try a WebBrowser class (http://msdn.microsoft.com/en-us/library/system.windows.controls.webbrowser%28v=vs.110%29.aspx) and then use its DOM Accessing DOM from WebBrowser to retrieve the links.
mshtml.IHTMLDocument2 htmlDoc = webBrowser.Document as mshtml.IHTMLDocument2;
// do something like find button and click
htmlDoc.all.item("testBtn").click();
Given a web URL, I want to detect all the links in a WEBSITE, identify the internal links and list them.
What I have is this:
WebClient webClient = null;
webClient = new WebClient();
string strUrl = "http://www.anysite.com";
string completeHTMLCode = "";
try
{
completeHTMLCode = webClient.DownloadString(strUrl);
}
catch (Exception)
{
}
Using this I can read the contents of the page....but the only idea I have in my mind is parsing this string....searching for <a then href then the value between the double quotes.
Is this the only way out? Or there lies some other better solution(s)?
Use the HTML Agility Pack. Here's a link to a blog post to get you started. Do not use Regex.
using HtmlAgilityPack
completeHTMLCode =
webClient.DownloadString(strUrl);
doc.Load(completeHTMLCode);
foreach(HtmlNode link in doc.DocumentElement.SelectNodes("//a[#a"])
{
//
}