c# find image in html and download them - c#

i want download all images stored in html(web page) , i dont know how much image will be download , and i don`t want use "HTML AGILITY PACK"
i search in google but all site make me more confused ,
i tried regex but only one result ... ,

People are giving you the right answer - you can't be picky and lazy, too. ;-)
If you use a half-baked solution, you'll deal with a lot of edge cases. Here's a working sample that gets all links in an HTML document using HTML Agility Pack (it's included in the HTML Agility Pack download).
And here's a blog post that shows how to grab all images in an HTML document with HTML Agility Pack and LINQ
// Bing Image Result for Cat, First Page
string url = "http://www.bing.com/images/search?q=cat&go=&form=QB&qs=n";
// For speed of dev, I use a WebClient
WebClient client = new WebClient();
string html = client.DownloadString(url);
// Load the Html into the agility pack
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
// Now, using LINQ to get all Images
List<HtmlNode> imageNodes = null;
imageNodes = (from HtmlNode node in doc.DocumentNode.SelectNodes("//img")
where node.Name == "img"
&& node.Attributes["class"] != null
&& node.Attributes["class"].Value.StartsWith("img_")
select node).ToList();
foreach(HtmlNode node in imageNodes)
{
Console.WriteLine(node.Attributes["src"].Value);
}

First of all I just can't leave this phrase alone:
images stored in html
That phrase is probably a big part of the reason your question was down-voted twice. Images are not stored in html. Html pages have references to images that web browsers download separately.
This means you need to do this in three steps: first download the html, then find the image references inside the html, and finally use those references to download the images themselves.
To accomplish this, look at the System.Net.WebClient() class. It has a .DownloadString() method you can use to get the html. Then you need to find all the <img /> tags. You're own your own here, but it's straightforward enough. Finally, you use WebClient's .DownloadData() or DownloadFile() methods to retrieve the images.

You can use a WebBrowser control and extract the HTML from that e.g.
System.Windows.Forms.WebBrowser objWebBrowser = new System.Windows.Forms.WebBrowser();
objWebBrowser.Navigate(new Uri("your url of html document"));
System.Windows.Forms.HtmlDocument objDoc = objWebBrowser.Document;
System.Windows.Forms.HtmlElementCollection aColl = objDoc.All.GetElementsByName("IMG");
...
or directly invoke the IHTMLDocument family of COM interfaces

In general terms
You need to fetch the html page
Search for img tags and extract the src="..." portion out of them
Keep a list of all these extracted image urls.
Download them one by one.
Maybe this question about C# HTML parser will help you a little bit more.

Related

Download pictures from a specific website in C#

I want to download lot of pictures from a specific website, but the pictures have different URLs (I mean they are not like something.com/picture1 then something.com/picture2) If it helps, I want to download from the EA's FUT card database, but I have no idea how should I do this.
You can use the HTML Agility pack to Parse every <img> from the response and get the source attribute.
Then you can loop through the image tags and download the image via. HttpClient, as you did with the webpage.
This would look something like this (response is the html returned by the web-request):
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(response);
foreach(HtmlNode imageSrc in doc.DocumentElement.SelectNodes("//img/#src")
{
//Use node.Value to download the picture here
}
Get more infos about the html Agility Pack here:
http://html-agility-pack.net/

Getting nodes from html page using HtmlAgilityPack

My program collects info about Steam users' profiles (such as games, badges and etc.). I use HtmlAgilityPack to collect data from html page and so far it worked for me just good.
The problem is that on some pages it works well, but on some - returns null nodes or throws an exception
object reference not set to an instance of an object
Here's an example.
This part works well (when I'm getting badges):
WebClient client = new WebClient();
string html = client.DownloadString("http://steamcommunity.com/profiles/*id*/badges/");
var doc = new HtmlDocument();
doc.LoadHtml(html);
HtmlNodeCollection div = doc.DocumentNode.SelectNodes("//div[#class=\"badge_row is_link\"]");
This returns the exact amout of badges and then I can do whatever I want with them.
But in this one I do the exact same thing (but getting games), and somehow it keeps throwing me and error I mentioned above:
WebClient client = new WebClient();
string html = client.DownloadString("http://steamcommunity.com/profiles/*id*/games/?tab=all");
var doc = new HtmlDocument();
doc.LoadHtml(html);
HtmlNodeCollection div = doc.DocumentNode.SelectNodes("//*[#id='game_33120']");
I know that there is the node on the page (checked via google chrome code view) and I don't know why in 1st case it works, but in the 2nd it doesn't.
When you right-click on the page and choose View Source do you still see an element with id='game_33120'? My guess is you won't. My guess is that the page is being built dynamically, client-side. Therefore, the HTML that comes down in the request doesn't contain the element you're looking for. Instead that element appears once the Javascript code has run in the browser.
It appears that the original request will have a section of Javascript that contains a variable called rgGames which is a Javascript array of the games that will be rendered on the screen. You should be able to extract the information from that.
I dont understand the selectNodes method with this parameter "//*[#id='game_33120']", maybe is this your fault, but you can check this:
The real link of an steamprofil with batches etc is:
http://steamcommunity.com/id/id/badges/
and not
http://steamcommunity.com/profiles/id/badges/
after I visited an badges page, the url stay in the browser, at the games link, they redirect you to
http:// steamcommunity. com
Maybe this can help you

How Can I fetch/scrape HTML text and images to Windows phone?

Hello,
I want to know that, How can I scrape a HTML Site's text which are in a list (ul, li) in Windows phone. I want to make a rss feed reader. Please in details, I am new in HTMLAgilityPack.
thanks.
This is not as simple as you would think. You will have to use HTMLAgility pack to parse and normalize the HTML content. but then you will need to go through each node to assess if it's content node or not, i.e. you would want to ignore DIVs, Embeds etc.
I'll try to help you get started.
READ THE DOCUMENT
Uri url = new Uri(<Your url>);
HtmlAgilityPack.HtmlWeb web = new HtmlAgilityPack.HtmlWeb();
HtmlAgilityPack.HtmlDocument document = web.Load(url.AbsoluteUri);
HERE IS HOW YOU CAN EXTRACT THE IMAGE AND Text tags
var docNode = documentNode;
// if you just want all text withing the document then life is simpler.
string htmlText = docNode.InnerText;
// Get images
IEnumerable<HtmlNode> imageNodes = docNode.Descendants("img");
// Now iterate through all the images and do what you like...
If you want to implement a Readability/Instapaper like cleanup then download NReadability from https://github.com/marek-stoj/NReadability

WebBrowser Control c#: finding particular link and "clicking" it programmatically?

I am working on csv downloader project ,i need to download the CSV files generated on the webpage . and using html agility , i found the exact link that contain the link for csv file
Download file in csv format
now i want , without any activity from my side , the application must detect this link in the web page ( i could do it by Htmlagility ) and should download the file once the web page fully navigated in Web browser in my app. I tried some example in one of the SO click here post but getting
Error :Object reference not set to an instance of an object.
HtmlElementCollection links = webBrowser.Document.GetElementsByTagName("A");
foreach (HtmlElement link in links) // this ex is given another SO post
{
if (link.InnerText.Equals("My Assigned"))
link.InvokeMember("Click");
}
Can any body suggest how to do it ??
Solved :
I changed to HtmlElementCollection links = webBrowser.Document.GetElementsByTagName("A"); to HtmlElementCollection links = webBrowser1.Document.Links and used
if (link.InnerText.Contains("My Assigned"))
{
link.InvokeMember("Click");
}
. any one who better solution?
InnerText might be null so build in a safeguard, to check for null:
if ((link.InnerText != null) && (link.InnerText.Equals("My Assigned")) )
link.InvokeMember("Click");
Actually, I would get rid of HTMLAgility pack (its pretty bad) and just go/loop through it yourself. Also, don't use innerText, because based on your examples, there doesn't seem to be an innertext in at least one of the links. Use the .href attribute and check for the .csv extension.
link.href.EndsWith(".csv")
And if there are more than one .cvs on each page, look for some url string or innertext property to refine it.
Also, the reason why your .GetElementsByTagName("A") was not working was because TagName refers to the name attribute of any particular TAG. So, you were saying, Get all TAG's with the TagType name="A"... does that make sense? I think there is a .GetElementsByTag[Type] or something like that which you can use to base it on the tag type and not the name attribute of a TAG.
Also, how are you downloading the .csv file? Is a "download dialog" box coming up or are you just showing people in the webbrowser control? (curious how you've handled that part).

Get text from a <Div> tag from a html page to c#

How can I get the text in a div tag from a webpage to my .cs file (C#)?
I tested the html agility pack but it did not work I got different error and it's probably because this is a Windows Phone 7 project. Has anyone else any idea how to solve this?
Silverlight C# Code
string text = HtmlPage.Window.Invoke("getDivText").ToString();
HTML
function getDivText() {
return YourDivText;
}
HtmlAgilityPack should be what you need. Make sure you get it from the NuGet, rather than directly from the project page, as the NuGet version includes a WP7 build.
Update
Windows Phone does not support synchronous networking APIs so HtmlAgilityPack can't support asynchronous loads. You need to pass a callback to LoadAsync to use it.
If you want to create document from string not actual file you should use-
doc.LoadHtml(string);
EDIT
This is how i use HtmlAgilityPack for parsing from webpage.(but this is in winForms)
string page;
using(WebClient client = new WebClient())
{
page = client.DownloadString(url);
}
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(page);
string result;
HtmlNode node = doc.DocumentNode.SelectSingleNode("//span[#class='obf']");
result = node.InnerText;

Categories