How Can I fetch/scrape HTML text and images to Windows phone? - c#

Hello,
I want to know that, How can I scrape a HTML Site's text which are in a list (ul, li) in Windows phone. I want to make a rss feed reader. Please in details, I am new in HTMLAgilityPack.
thanks.

This is not as simple as you would think. You will have to use HTMLAgility pack to parse and normalize the HTML content. but then you will need to go through each node to assess if it's content node or not, i.e. you would want to ignore DIVs, Embeds etc.
I'll try to help you get started.
READ THE DOCUMENT
Uri url = new Uri(<Your url>);
HtmlAgilityPack.HtmlWeb web = new HtmlAgilityPack.HtmlWeb();
HtmlAgilityPack.HtmlDocument document = web.Load(url.AbsoluteUri);
HERE IS HOW YOU CAN EXTRACT THE IMAGE AND Text tags
var docNode = documentNode;
// if you just want all text withing the document then life is simpler.
string htmlText = docNode.InnerText;
// Get images
IEnumerable<HtmlNode> imageNodes = docNode.Descendants("img");
// Now iterate through all the images and do what you like...
If you want to implement a Readability/Instapaper like cleanup then download NReadability from https://github.com/marek-stoj/NReadability

Related

How to extract a specific line from a webpage in c#

HttpWebRequest myReq = (HttpWebRequest)WebRequest.Create("https://www.google.com/search?q=" + "msg");
HttpWebResponse myres = (HttpWebResponse)myReq.GetResponse();
using (StreamReader sr = new StreamReader(myres.GetResponseStream()))
{
pageContent = sr.ReadToEnd();
}
if (pageContent.Contains("find"))
{
display = "done";
}
currently what this code does is check if "find" exists on a url and display done if it is present
What I want is to display the whole line or para which contains "find".
So like instead display="done" I want to store the line which contains find in display
HTML pages don't have lines. Whitespace outside tags is ignored and an entire minified page may have no newlines at all. Even if it did, newlines are simply ignored even inside tags.That's why <br> is necessary. If you want to find a specific element you'll have to use an HTML parser like HTMLAgilityPack and identify the element using an XPath or CSS selector expression.
Copying from the landing page examples:
var url = $"https://www.google.com/search?q={msg}" ;
var web = new HtmlWeb();
var doc = web.Load(url);
var value = doc.DocumentNode
.SelectNodes("//div[#id='center_col']")
.First()
.Attributes["value"].Value;
What you put in SelectNodes depends on what you want to find.
One way to test various expressions is to open the web page you want in a browser, open the browser's Developer Tools and start searching in the Element inspector. The search functionality there accepts XPath and CSS selectors.

Loading file with HTML agility pack

I have a list of websites that's been generated and stored into a text file. Now I'm trying to load that file so I can repeat the process of extracting website URLS.
Every time I run that application, HtmlAgilityPack.HtmlDocument is the only thing that's populated in the console window.
private static async void GetHtmlAsync1()
{
var doc = new HtmlDocument();
doc.Load(FilenameHere);
Console.WriteLine(doc);
}
Am I coming across this right?
Thanks
This is an example of loading text file full or URLs and reading their content. My test file is in the same location as my project files.
List<string> allUrls = File.ReadAllLines($#"{Directory.GetParent(Environment.CurrentDirectory).Parent.Parent.FullName}\test.txt").ToList();
HtmlDocument doc = new HtmlDocument();
foreach(string url in allUrls)
{
doc = new HtmlWeb().Load(url);
Console.WriteLine(doc.DocumentNode.InnerHtml);
}
Please note, i am only printing the entire website, you can use HtmlAgilityPack to actually scrape the data you are interested in (like pulling all the links, or specific class item.
Read in the lines from File
Load the data from URL using HtmlWeb.
Iterate through each URL and get what you need.

Download pictures from a specific website in C#

I want to download lot of pictures from a specific website, but the pictures have different URLs (I mean they are not like something.com/picture1 then something.com/picture2) If it helps, I want to download from the EA's FUT card database, but I have no idea how should I do this.
You can use the HTML Agility pack to Parse every <img> from the response and get the source attribute.
Then you can loop through the image tags and download the image via. HttpClient, as you did with the webpage.
This would look something like this (response is the html returned by the web-request):
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(response);
foreach(HtmlNode imageSrc in doc.DocumentElement.SelectNodes("//img/#src")
{
//Use node.Value to download the picture here
}
Get more infos about the html Agility Pack here:
http://html-agility-pack.net/

Get text from a <Div> tag from a html page to c#

How can I get the text in a div tag from a webpage to my .cs file (C#)?
I tested the html agility pack but it did not work I got different error and it's probably because this is a Windows Phone 7 project. Has anyone else any idea how to solve this?
Silverlight C# Code
string text = HtmlPage.Window.Invoke("getDivText").ToString();
HTML
function getDivText() {
return YourDivText;
}
HtmlAgilityPack should be what you need. Make sure you get it from the NuGet, rather than directly from the project page, as the NuGet version includes a WP7 build.
Update
Windows Phone does not support synchronous networking APIs so HtmlAgilityPack can't support asynchronous loads. You need to pass a callback to LoadAsync to use it.
If you want to create document from string not actual file you should use-
doc.LoadHtml(string);
EDIT
This is how i use HtmlAgilityPack for parsing from webpage.(but this is in winForms)
string page;
using(WebClient client = new WebClient())
{
page = client.DownloadString(url);
}
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(page);
string result;
HtmlNode node = doc.DocumentNode.SelectSingleNode("//span[#class='obf']");
result = node.InnerText;

c# find image in html and download them

i want download all images stored in html(web page) , i dont know how much image will be download , and i don`t want use "HTML AGILITY PACK"
i search in google but all site make me more confused ,
i tried regex but only one result ... ,
People are giving you the right answer - you can't be picky and lazy, too. ;-)
If you use a half-baked solution, you'll deal with a lot of edge cases. Here's a working sample that gets all links in an HTML document using HTML Agility Pack (it's included in the HTML Agility Pack download).
And here's a blog post that shows how to grab all images in an HTML document with HTML Agility Pack and LINQ
// Bing Image Result for Cat, First Page
string url = "http://www.bing.com/images/search?q=cat&go=&form=QB&qs=n";
// For speed of dev, I use a WebClient
WebClient client = new WebClient();
string html = client.DownloadString(url);
// Load the Html into the agility pack
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
// Now, using LINQ to get all Images
List<HtmlNode> imageNodes = null;
imageNodes = (from HtmlNode node in doc.DocumentNode.SelectNodes("//img")
where node.Name == "img"
&& node.Attributes["class"] != null
&& node.Attributes["class"].Value.StartsWith("img_")
select node).ToList();
foreach(HtmlNode node in imageNodes)
{
Console.WriteLine(node.Attributes["src"].Value);
}
First of all I just can't leave this phrase alone:
images stored in html
That phrase is probably a big part of the reason your question was down-voted twice. Images are not stored in html. Html pages have references to images that web browsers download separately.
This means you need to do this in three steps: first download the html, then find the image references inside the html, and finally use those references to download the images themselves.
To accomplish this, look at the System.Net.WebClient() class. It has a .DownloadString() method you can use to get the html. Then you need to find all the <img /> tags. You're own your own here, but it's straightforward enough. Finally, you use WebClient's .DownloadData() or DownloadFile() methods to retrieve the images.
You can use a WebBrowser control and extract the HTML from that e.g.
System.Windows.Forms.WebBrowser objWebBrowser = new System.Windows.Forms.WebBrowser();
objWebBrowser.Navigate(new Uri("your url of html document"));
System.Windows.Forms.HtmlDocument objDoc = objWebBrowser.Document;
System.Windows.Forms.HtmlElementCollection aColl = objDoc.All.GetElementsByName("IMG");
...
or directly invoke the IHTMLDocument family of COM interfaces
In general terms
You need to fetch the html page
Search for img tags and extract the src="..." portion out of them
Keep a list of all these extracted image urls.
Download them one by one.
Maybe this question about C# HTML parser will help you a little bit more.

Categories