Loading file with HTML agility pack - c#

I have a list of websites that's been generated and stored into a text file. Now I'm trying to load that file so I can repeat the process of extracting website URLS.
Every time I run that application, HtmlAgilityPack.HtmlDocument is the only thing that's populated in the console window.
private static async void GetHtmlAsync1()
{
var doc = new HtmlDocument();
doc.Load(FilenameHere);
Console.WriteLine(doc);
}
Am I coming across this right?
Thanks

This is an example of loading text file full or URLs and reading their content. My test file is in the same location as my project files.
List<string> allUrls = File.ReadAllLines($#"{Directory.GetParent(Environment.CurrentDirectory).Parent.Parent.FullName}\test.txt").ToList();
HtmlDocument doc = new HtmlDocument();
foreach(string url in allUrls)
{
doc = new HtmlWeb().Load(url);
Console.WriteLine(doc.DocumentNode.InnerHtml);
}
Please note, i am only printing the entire website, you can use HtmlAgilityPack to actually scrape the data you are interested in (like pulling all the links, or specific class item.
Read in the lines from File
Load the data from URL using HtmlWeb.
Iterate through each URL and get what you need.

Related

How to extract a specific line from a webpage in c#

HttpWebRequest myReq = (HttpWebRequest)WebRequest.Create("https://www.google.com/search?q=" + "msg");
HttpWebResponse myres = (HttpWebResponse)myReq.GetResponse();
using (StreamReader sr = new StreamReader(myres.GetResponseStream()))
{
pageContent = sr.ReadToEnd();
}
if (pageContent.Contains("find"))
{
display = "done";
}
currently what this code does is check if "find" exists on a url and display done if it is present
What I want is to display the whole line or para which contains "find".
So like instead display="done" I want to store the line which contains find in display
HTML pages don't have lines. Whitespace outside tags is ignored and an entire minified page may have no newlines at all. Even if it did, newlines are simply ignored even inside tags.That's why <br> is necessary. If you want to find a specific element you'll have to use an HTML parser like HTMLAgilityPack and identify the element using an XPath or CSS selector expression.
Copying from the landing page examples:
var url = $"https://www.google.com/search?q={msg}" ;
var web = new HtmlWeb();
var doc = web.Load(url);
var value = doc.DocumentNode
.SelectNodes("//div[#id='center_col']")
.First()
.Attributes["value"].Value;
What you put in SelectNodes depends on what you want to find.
One way to test various expressions is to open the web page you want in a browser, open the browser's Developer Tools and start searching in the Element inspector. The search functionality there accepts XPath and CSS selectors.

Fetching data from a web page to a C# application

I am trying to create a desktop application in C# that will retrieve data from a website. In short, this is an application that I will use to create statistics for my local league's fantasy football (soccer) game. All the data I want to use is freely available online, but there are no APIs available to retrieve the data.
The first thing I tried was to get the HTML code for the website using WebClient and DownloadString:
WebClient client = new WebClient();
string priceChangeString = client.DownloadString(url);
However, it turned out that the data is not in the HTML string.
If I use Developer Tools in Chrome I can inspect the page under "elements". Here I see that the data I want:
Screenshot from Chrome Developer Tools
I have tried to get these values by using "Copy as XPath" and HtmlAgilityPack, but I can't get this to work my code:
using HtmlAgilityPack;
string url = "https://fantasy.eliteserien.no/a/statistics/cost_change_start";
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load(url);
string userscore = doc.DocumentNode.SelectNodes("//*[#id=\"main\"]/div[3]/div/div[2]/div[1]/div[2]/div[1]/div/div[2]/a/span[1]")[0].InnerText;
I have tried several variations of this code, but they all returns NullReferenceExceptions:
Unhandled Exception: System.NullReferenceException: Object reference not set to an instance of an object.
at FantasyTest.Program.Main(String[] args) in C:\Users\my_username\source\repos\FantasyTest\FantasyTest\Program.cs:line 27
Does anyone see what I'm doing wrong when I try to use HtmlAgilityPack and XPath? Are there any other approaches I can take to solve this?
The web page from this example can be found here
I used a list to store all the information, and then search through that list for example <span>, and in all the <spans> I made the application to search for class="card-list".
var url = "https://fantasy.eliteserien.no/a/statistics/cost_change_start";
var httpClient = new HttpClient();
var html = await httpClient.GetStringAsync(url);
var htmlDocument = new HtmlDocument();
htmlDocument.LoadHtml(html);
//This is the part of the code that takes information from the website
//Note that this part matches your screenshot, in the HTML code
//You can use that there is a table with class="ism-table ism-table--el"
//This piece of code target that specific table
var ProductsHtml = htmlDocument.DocumentNode.Descendants("table")
.Where(node => node.GetAttributeValue("class", "")
.Equals("ism-table ism-table--el")).ToList(); ;
try{
var ProductListItems = ProductsHtml[0].Descendants("tr")
foreach (var ProductListItem in ProductListItems)
{
//This targets whats inside the table
Console.WriteLine("Id: " +
ProductListItem.Descendants("<HEADER>")
.Where(node => node.GetAttributeValue("<CLASS>", "")
.Equals("<CLASS=>")).FirstOrDefault().InnerText
);
}
In your case I think you need regex to match the numbers. This site have the numbers in <td>number</td> format. What we need is <td class="mNOK">number</td>.
So you need to use regex to match all the numbers. To do that we do:
//Regex Match numbers in <td>
Console.WriteLine("numbers: " +
Regex.Match(ProductListItem.Descendants("td").FirstOrDefault().InnerText
, #[0-9]")
);
Note that you need to change <URL>, <HEADER>, <CLASS> and <CLASS=>.
<URL>: The site you want to take information from,
<HEADER>: What header inside the HTML code do you want to target
reading. For example "span, div, li, ul",
<CLASS>: Inside that header, what do you want to look for. Example
"id, name",
<CLASS=>: What does the <CLASS> need to be equal to, to read the
inner text
If you don’t mind calling an external python program, I’d suggest looking at python and the library called “BeautifulSoup”. It parses html nicely. Have the python program write out an xml file that your application can deserialize... the c# program can then do whatever it needs to do using that deserialized structure.
Thank you all for the feedback on this post, it has helped me find a solution to this problem.
It turned out that the data I wanted to retrieve was loaded with javascript. This means that the methods HtmlWeb and HtmlDocument from HtmlAgilityPack loads the html before the data I want has been loaded to the page, and these can thus not be used for this purpose.
I got around this by using a headless browser. I downloaded Chromdriver and Selenium via Nuget, and got the data I wanted by using the following code:
using OpenQA.Selenium.Chrome;
var chromeOptions = new ChromeOptions();
chromeOptions.AddArguments("headless");
using (var driver = new ChromeDriver(chromeOptions))
{
driver.Navigate().GoToUrl("https://fantasy.eliteserien.no/a/statistics/cost_change_start");
// As IWebElement
var fantasyTable = driver.FindElementByClassName("ism-scroll-table");
// Content as text-string
string fantasyTableText = fantasyTable.Text;
// As Html-string
string fantasyTableAsHtml = fantasyTable.GetAttribute("innerHTML");
// My code for handling the data follows here...
}
Resource used to solve this:
How to start ChromeDriver in headless mode

Download pictures from a specific website in C#

I want to download lot of pictures from a specific website, but the pictures have different URLs (I mean they are not like something.com/picture1 then something.com/picture2) If it helps, I want to download from the EA's FUT card database, but I have no idea how should I do this.
You can use the HTML Agility pack to Parse every <img> from the response and get the source attribute.
Then you can loop through the image tags and download the image via. HttpClient, as you did with the webpage.
This would look something like this (response is the html returned by the web-request):
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(response);
foreach(HtmlNode imageSrc in doc.DocumentElement.SelectNodes("//img/#src")
{
//Use node.Value to download the picture here
}
Get more infos about the html Agility Pack here:
http://html-agility-pack.net/

How to save string content to a local file

I am developing an application which is showing web pages through a web browser control.
When I click the save button, the web page with images should be stored in local storage. It should be save in .html format.
I have the following code:
WebRequest request = WebRequest.Create(txtURL.Text);
WebResponse response = request.GetResponse();
Stream data = response.GetResponseStream();
string html = String.Empty;
using (StreamReader sr = new StreamReader(data))
{
html = sr.ReadToEnd();
}
Now string html contains the webpage content. I need to save this into D:\Cache\
How do i save the html contents to disk?
You can use this code to write your HTML string to a file:
var path= #"D:\Cache\myfile.html";
File.WriteAllText(path, html);
Further refinement: Extract the filename from your (textual) URL.
Update:
See Get file name from URI string in C# for details. The idea is:
var uri = new Uri(txtUrl.Text);
var filename = uri.IsFile
? System.IO.Path.GetFileName(uri.LocalPath)
: "unknown-file.html";
you have to write below code on save button
File.WriteAllText(path, browser.Document.Body.Parent.OuterHtml, Encoding.GetEncoding(browser.Document.Encoding));
Now the 'Body.parent' must save whole the page instead of just saving only part.
check it.
There is nothing built-in to the .NET Framework as far I know.
So my approach would be like below:
Use System.NET.HttpWebRequest to get the main HTML document as a
string or stream (easy). (Which you have done already)
Load this into a HTMLAgilityPack document where you can now easily
query the document to get lists of all image elements, stylesheet
links, etc.
Then make a separate web request for each of these files and save
them to a subdirectory.
Finally update all relevent links in the main page to point to the
items in the subdirectory.

How Can I fetch/scrape HTML text and images to Windows phone?

Hello,
I want to know that, How can I scrape a HTML Site's text which are in a list (ul, li) in Windows phone. I want to make a rss feed reader. Please in details, I am new in HTMLAgilityPack.
thanks.
This is not as simple as you would think. You will have to use HTMLAgility pack to parse and normalize the HTML content. but then you will need to go through each node to assess if it's content node or not, i.e. you would want to ignore DIVs, Embeds etc.
I'll try to help you get started.
READ THE DOCUMENT
Uri url = new Uri(<Your url>);
HtmlAgilityPack.HtmlWeb web = new HtmlAgilityPack.HtmlWeb();
HtmlAgilityPack.HtmlDocument document = web.Load(url.AbsoluteUri);
HERE IS HOW YOU CAN EXTRACT THE IMAGE AND Text tags
var docNode = documentNode;
// if you just want all text withing the document then life is simpler.
string htmlText = docNode.InnerText;
// Get images
IEnumerable<HtmlNode> imageNodes = docNode.Descendants("img");
// Now iterate through all the images and do what you like...
If you want to implement a Readability/Instapaper like cleanup then download NReadability from https://github.com/marek-stoj/NReadability

Categories