I want to download lot of pictures from a specific website, but the pictures have different URLs (I mean they are not like something.com/picture1 then something.com/picture2) If it helps, I want to download from the EA's FUT card database, but I have no idea how should I do this.
You can use the HTML Agility pack to Parse every <img> from the response and get the source attribute.
Then you can loop through the image tags and download the image via. HttpClient, as you did with the webpage.
This would look something like this (response is the html returned by the web-request):
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(response);
foreach(HtmlNode imageSrc in doc.DocumentElement.SelectNodes("//img/#src")
{
//Use node.Value to download the picture here
}
Get more infos about the html Agility Pack here:
http://html-agility-pack.net/
Related
I have a list of websites that's been generated and stored into a text file. Now I'm trying to load that file so I can repeat the process of extracting website URLS.
Every time I run that application, HtmlAgilityPack.HtmlDocument is the only thing that's populated in the console window.
private static async void GetHtmlAsync1()
{
var doc = new HtmlDocument();
doc.Load(FilenameHere);
Console.WriteLine(doc);
}
Am I coming across this right?
Thanks
This is an example of loading text file full or URLs and reading their content. My test file is in the same location as my project files.
List<string> allUrls = File.ReadAllLines($#"{Directory.GetParent(Environment.CurrentDirectory).Parent.Parent.FullName}\test.txt").ToList();
HtmlDocument doc = new HtmlDocument();
foreach(string url in allUrls)
{
doc = new HtmlWeb().Load(url);
Console.WriteLine(doc.DocumentNode.InnerHtml);
}
Please note, i am only printing the entire website, you can use HtmlAgilityPack to actually scrape the data you are interested in (like pulling all the links, or specific class item.
Read in the lines from File
Load the data from URL using HtmlWeb.
Iterate through each URL and get what you need.
Hello,
I want to know that, How can I scrape a HTML Site's text which are in a list (ul, li) in Windows phone. I want to make a rss feed reader. Please in details, I am new in HTMLAgilityPack.
thanks.
This is not as simple as you would think. You will have to use HTMLAgility pack to parse and normalize the HTML content. but then you will need to go through each node to assess if it's content node or not, i.e. you would want to ignore DIVs, Embeds etc.
I'll try to help you get started.
READ THE DOCUMENT
Uri url = new Uri(<Your url>);
HtmlAgilityPack.HtmlWeb web = new HtmlAgilityPack.HtmlWeb();
HtmlAgilityPack.HtmlDocument document = web.Load(url.AbsoluteUri);
HERE IS HOW YOU CAN EXTRACT THE IMAGE AND Text tags
var docNode = documentNode;
// if you just want all text withing the document then life is simpler.
string htmlText = docNode.InnerText;
// Get images
IEnumerable<HtmlNode> imageNodes = docNode.Descendants("img");
// Now iterate through all the images and do what you like...
If you want to implement a Readability/Instapaper like cleanup then download NReadability from https://github.com/marek-stoj/NReadability
How can I get the text in a div tag from a webpage to my .cs file (C#)?
I tested the html agility pack but it did not work I got different error and it's probably because this is a Windows Phone 7 project. Has anyone else any idea how to solve this?
Silverlight C# Code
string text = HtmlPage.Window.Invoke("getDivText").ToString();
HTML
function getDivText() {
return YourDivText;
}
HtmlAgilityPack should be what you need. Make sure you get it from the NuGet, rather than directly from the project page, as the NuGet version includes a WP7 build.
Update
Windows Phone does not support synchronous networking APIs so HtmlAgilityPack can't support asynchronous loads. You need to pass a callback to LoadAsync to use it.
If you want to create document from string not actual file you should use-
doc.LoadHtml(string);
EDIT
This is how i use HtmlAgilityPack for parsing from webpage.(but this is in winForms)
string page;
using(WebClient client = new WebClient())
{
page = client.DownloadString(url);
}
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(page);
string result;
HtmlNode node = doc.DocumentNode.SelectSingleNode("//span[#class='obf']");
result = node.InnerText;
I have continually had problems with Html Agility Pack; my XPath queries only ever work when they are extremely simple:
//*[#id='some_id']
or
//input
However, anytime they get more complicated, then Html Agility Pack can't handle it.
Here's an example demonstrating the problem, I'm using WebDriver to navigate to Google, and return the page source, which is passed to Html Agility Pack, and both WebDriver and HtmlAgilityPack attempt to locate the element/node (C#):
//The XPath query
const string xpath = "//form//tr[1]/td[1]//input[#name='q']";
//Navigate to Google and get page source
var driver = new FirefoxDriver(new FirefoxProfile()) { Url = "http://www.google.com" };
Thread.Sleep(2000);
//Can WebDriver find it?
var e = driver.FindElementByXPath(xpath);
Console.WriteLine(e!=null ? "Webdriver success" : "Webdriver failure");
//Can Html Agility Pack find it?
var source = driver.PageSource;
var htmlDoc = new HtmlDocument { OptionFixNestedTags = true };
htmlDoc.LoadHtml(source);
var nodes = htmlDoc.DocumentNode.SelectNodes(xpath);
Console.WriteLine(nodes!=null ? "Html Agility Pack success" : "Html Agility Pack failure");
driver.Quit();
In this case, WebDriver successfully located the item, but Html Agility Pack did not.
I know, I know, in this case it's very easy to change the xpath to one that will work: //input[#name='q'], but that will only fix this specific example, which isn't the point, I need something that will exactly or at least closely mirror the behavior of WebDriver's xpath engine, or even the FirePath or FireFinder add-ons to Firefox.
If WebDriver can find it, then why can't Html Agility Pack find it too?
The issue you're running into is with the FORM element. HTML Agility Pack handles that element differently - by default, it will never report that it has children.
In the particular example you gave, this query does find the target element:
.//div/div[2]/table/tr/td/table/tr/td/div/table/tr/td/div/div[2]/input
However, this does not, so it's clear the form element is tripping up the parser:
.//form/div/div[2]/table/tr/td/table/tr/td/div/table/tr/td/div/div[2]/input
That behavior is configurable, though. If you place this line prior to parsing the HTML, the form will give you child nodes:
HtmlNode.ElementsFlags.Remove("form");
i want download all images stored in html(web page) , i dont know how much image will be download , and i don`t want use "HTML AGILITY PACK"
i search in google but all site make me more confused ,
i tried regex but only one result ... ,
People are giving you the right answer - you can't be picky and lazy, too. ;-)
If you use a half-baked solution, you'll deal with a lot of edge cases. Here's a working sample that gets all links in an HTML document using HTML Agility Pack (it's included in the HTML Agility Pack download).
And here's a blog post that shows how to grab all images in an HTML document with HTML Agility Pack and LINQ
// Bing Image Result for Cat, First Page
string url = "http://www.bing.com/images/search?q=cat&go=&form=QB&qs=n";
// For speed of dev, I use a WebClient
WebClient client = new WebClient();
string html = client.DownloadString(url);
// Load the Html into the agility pack
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
// Now, using LINQ to get all Images
List<HtmlNode> imageNodes = null;
imageNodes = (from HtmlNode node in doc.DocumentNode.SelectNodes("//img")
where node.Name == "img"
&& node.Attributes["class"] != null
&& node.Attributes["class"].Value.StartsWith("img_")
select node).ToList();
foreach(HtmlNode node in imageNodes)
{
Console.WriteLine(node.Attributes["src"].Value);
}
First of all I just can't leave this phrase alone:
images stored in html
That phrase is probably a big part of the reason your question was down-voted twice. Images are not stored in html. Html pages have references to images that web browsers download separately.
This means you need to do this in three steps: first download the html, then find the image references inside the html, and finally use those references to download the images themselves.
To accomplish this, look at the System.Net.WebClient() class. It has a .DownloadString() method you can use to get the html. Then you need to find all the <img /> tags. You're own your own here, but it's straightforward enough. Finally, you use WebClient's .DownloadData() or DownloadFile() methods to retrieve the images.
You can use a WebBrowser control and extract the HTML from that e.g.
System.Windows.Forms.WebBrowser objWebBrowser = new System.Windows.Forms.WebBrowser();
objWebBrowser.Navigate(new Uri("your url of html document"));
System.Windows.Forms.HtmlDocument objDoc = objWebBrowser.Document;
System.Windows.Forms.HtmlElementCollection aColl = objDoc.All.GetElementsByName("IMG");
...
or directly invoke the IHTMLDocument family of COM interfaces
In general terms
You need to fetch the html page
Search for img tags and extract the src="..." portion out of them
Keep a list of all these extracted image urls.
Download them one by one.
Maybe this question about C# HTML parser will help you a little bit more.