Can't using HtmlDocument in C#

Can't using HtmlDocument in C# - c#

I create Console Application using Selenium to get the text from a table.
Tried with code:
IList<IWebElement> tableRows = browser.FindElementsByXPath("id('column2')/tbody/tr");
var doc = new HtmlDocument();
doc.LoadHtml(tableRows);
This error like:
'HtmlDocument' does not contain a constructor that takes 0 arguments
I read this answer from question
Almost people in Stackoverflow can be using like:
new HtmlDocument.
Why I can't be using this. I tried with Winform Application, but I also can't using HtmlDocument.
And HtmlDocument seems only LoadHmtl(String). But my code is IList<IWebElement>.
I don't know how to convert it to HTML string to add to doc.

IWebElement table = browser.FindElement(By.Id("column2");
var doc = new HtmlDocument();
doc.LoadHtml(table.InnerHtml);
first off all you can get the table elements using selenium... , if you chose to use agility pack you need to send to LoadHtml method string variable with html source so what you need to do is to find the html block (in your case is the table) take it as IWebElement and send it to LoadHtml using table.InnerHtml
also you can send the full page source doc.LoadHtml(driver.PageSource);

Related

How to extract a specific line from a webpage in c#

HttpWebRequest myReq = (HttpWebRequest)WebRequest.Create("https://www.google.com/search?q=" + "msg");
HttpWebResponse myres = (HttpWebResponse)myReq.GetResponse();
using (StreamReader sr = new StreamReader(myres.GetResponseStream()))
{
pageContent = sr.ReadToEnd();
}
if (pageContent.Contains("find"))
{
display = "done";
}
currently what this code does is check if "find" exists on a url and display done if it is present
What I want is to display the whole line or para which contains "find".
So like instead display="done" I want to store the line which contains find in display

HTML pages don't have lines. Whitespace outside tags is ignored and an entire minified page may have no newlines at all. Even if it did, newlines are simply ignored even inside tags.That's why <br> is necessary. If you want to find a specific element you'll have to use an HTML parser like HTMLAgilityPack and identify the element using an XPath or CSS selector expression.
Copying from the landing page examples:
var url = $"https://www.google.com/search?q={msg}" ;
var web = new HtmlWeb();
var doc = web.Load(url);
var value = doc.DocumentNode
.SelectNodes("//div[#id='center_col']")
.First()
.Attributes["value"].Value;
What you put in SelectNodes depends on what you want to find.
One way to test various expressions is to open the web page you want in a browser, open the browser's Developer Tools and start searching in the Element inspector. The search functionality there accepts XPath and CSS selectors.

How to select html element with namespace with Fizzler / HtmlAgilityPack?

I am using Fizzler / HtmlAgilityPack to parse and extract elements from ASP.NET page file. In the asp.net file, we also use Telerik controls, e.g.
<telerik:RadGrid ... >
To extract this element , I used the methods below but not success. Can someone help on this please?
method#1:
HtmlDocument document = .....;
document.SelectNodes("telerik:RadGrid");
and it throws exception below:
Then I tried method#2:
XPathNavigator navigator = document.CreateNavigator();
var manager = new XmlNamespaceManager(navigator.NameTable);
manager.AddNamespace("telerik", "http://www.telerik.com");
var expr = XPathExpression.Compile("RadGrid");
expr.SetContext(manager);
var grids = document.DocumentNode.SelectNodes(expr);
There is no exception again. But grids is null even though the asp.net page contains markup of telerik:RadGrid.

It could be that your xpath is incorrect.
Please try this //*[name()='telerik:RadGrid'] as a namespace, it should work for elements with XML Namespace.

Can I scrape an HTML class without using a 3rd party library like HtmlAgilityPack?

I wanted to make a program which reads the description of a picture album over at imgur.com (this for example: https://imgur.com/gallery/DsAE9cv)
The element would be
<div class="post-image-description">One owner?</div>
but I have a hard time getting the description (One owner).
Would be very helpful to get some tips!
I tried using HtmlAgilityPack and using the XPath, but it's not working.
string link = txt_Link.Text;
var web = new HtmlAgilityPack.HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = web.Load(link);
var description = doc.DocumentNode.SelectSingleNode("/html[1]/body[1]/div[8]/div[2]/div[2]/div[2]/div[1]/div[2]/p[1]");
txt_Return.Text = description.ToString();
I expected the output of "One owner?" but I got "NULL" (textbox is showing "HtmlAgility.Node".

description.ToString() does not return the expected result.
Use description.InnerText property to view the title.
Returns "One owner?" in your example.

Try using some online XPath tester tool, like http://xpather.com/
You might try this XPath to get the result you need:
//p[#class='post-image-description']/text()

Fetching data from a web page to a C# application

I am trying to create a desktop application in C# that will retrieve data from a website. In short, this is an application that I will use to create statistics for my local league's fantasy football (soccer) game. All the data I want to use is freely available online, but there are no APIs available to retrieve the data.
The first thing I tried was to get the HTML code for the website using WebClient and DownloadString:
WebClient client = new WebClient();
string priceChangeString = client.DownloadString(url);
However, it turned out that the data is not in the HTML string.
If I use Developer Tools in Chrome I can inspect the page under "elements". Here I see that the data I want:
Screenshot from Chrome Developer Tools
I have tried to get these values by using "Copy as XPath" and HtmlAgilityPack, but I can't get this to work my code:
using HtmlAgilityPack;
string url = "https://fantasy.eliteserien.no/a/statistics/cost_change_start";
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load(url);
string userscore = doc.DocumentNode.SelectNodes("//*[#id=\"main\"]/div[3]/div/div[2]/div[1]/div[2]/div[1]/div/div[2]/a/span[1]")[0].InnerText;
I have tried several variations of this code, but they all returns NullReferenceExceptions:
Unhandled Exception: System.NullReferenceException: Object reference not set to an instance of an object.
at FantasyTest.Program.Main(String[] args) in C:\Users\my_username\source\repos\FantasyTest\FantasyTest\Program.cs:line 27
Does anyone see what I'm doing wrong when I try to use HtmlAgilityPack and XPath? Are there any other approaches I can take to solve this?
The web page from this example can be found here

I used a list to store all the information, and then search through that list for example <span>, and in all the <spans> I made the application to search for class="card-list".
var url = "https://fantasy.eliteserien.no/a/statistics/cost_change_start";
var httpClient = new HttpClient();
var html = await httpClient.GetStringAsync(url);
var htmlDocument = new HtmlDocument();
htmlDocument.LoadHtml(html);
//This is the part of the code that takes information from the website
//Note that this part matches your screenshot, in the HTML code
//You can use that there is a table with class="ism-table ism-table--el"
//This piece of code target that specific table
var ProductsHtml = htmlDocument.DocumentNode.Descendants("table")
.Where(node => node.GetAttributeValue("class", "")
.Equals("ism-table ism-table--el")).ToList(); ;
try{
var ProductListItems = ProductsHtml[0].Descendants("tr")
foreach (var ProductListItem in ProductListItems)
{
//This targets whats inside the table
Console.WriteLine("Id: " +
ProductListItem.Descendants("<HEADER>")
.Where(node => node.GetAttributeValue("<CLASS>", "")
.Equals("<CLASS=>")).FirstOrDefault().InnerText
);
}
In your case I think you need regex to match the numbers. This site have the numbers in <td>number</td> format. What we need is <td class="mNOK">number</td>.
So you need to use regex to match all the numbers. To do that we do:
//Regex Match numbers in <td>
Console.WriteLine("numbers: " +
Regex.Match(ProductListItem.Descendants("td").FirstOrDefault().InnerText
, #[0-9]")
);
Note that you need to change <URL>, <HEADER>, <CLASS> and <CLASS=>.
<URL>: The site you want to take information from,
<HEADER>: What header inside the HTML code do you want to target
reading. For example "span, div, li, ul",
<CLASS>: Inside that header, what do you want to look for. Example
"id, name",
<CLASS=>: What does the <CLASS> need to be equal to, to read the
inner text

If you don’t mind calling an external python program, I’d suggest looking at python and the library called “BeautifulSoup”. It parses html nicely. Have the python program write out an xml file that your application can deserialize... the c# program can then do whatever it needs to do using that deserialized structure.

Thank you all for the feedback on this post, it has helped me find a solution to this problem.
It turned out that the data I wanted to retrieve was loaded with javascript. This means that the methods HtmlWeb and HtmlDocument from HtmlAgilityPack loads the html before the data I want has been loaded to the page, and these can thus not be used for this purpose.
I got around this by using a headless browser. I downloaded Chromdriver and Selenium via Nuget, and got the data I wanted by using the following code:
using OpenQA.Selenium.Chrome;
var chromeOptions = new ChromeOptions();
chromeOptions.AddArguments("headless");
using (var driver = new ChromeDriver(chromeOptions))
{
driver.Navigate().GoToUrl("https://fantasy.eliteserien.no/a/statistics/cost_change_start");
// As IWebElement
var fantasyTable = driver.FindElementByClassName("ism-scroll-table");
// Content as text-string
string fantasyTableText = fantasyTable.Text;
// As Html-string
string fantasyTableAsHtml = fantasyTable.GetAttribute("innerHTML");
// My code for handling the data follows here...
}
Resource used to solve this:
How to start ChromeDriver in headless mode

How to get an element using c#

I'm new with C#, and I'm trying to access an element from a website using webBrowser. I wondered how can I get the "Developers" string from the site:
<div id="title" style="display: block;">
<b>Title:</b> **Developers**
</div>
I tried to use webBrowser1.Document.GetElementById("title") ,but I have no idea how to keep going from here.
Thanks :)

You can download the source code using WebClient class
then look within the file for the <b>Title:</b>**Developers**</div> and then omit everything beside the "Developers".

HtmlAgilityPack and CsQuery is the way many people has taken to work with HTML page in .NET, I'd recommend them too.
But in case your task is limited to this simple requirement, and you have a <div> markup that is valid XHTML (like the markup sample you posted), then you can treat it as an XML. Means you can use .NET native API such as XDocument or XmlDocument to parse the HTML and perform an XPath query to get specific part from it, for example :
var xml = #"<div id=""title"" style=""display: block;""> <b>Title:</b> Developers</div>";
//or according to your code snippet, you may be able to do as follow :
//var xml = webBrowser1.Document.GetElementById("title").OuterHtml;
var doc = new XmlDocument();
doc.LoadXml(xml);
var text = doc.DocumentElement.SelectSingleNode("//div/b/following-sibling::text()");
Console.WriteLine(text.InnerText);
//above prints " Developers"
Above XPath select text node ("Developers") next to <b> node.

You can use HtmlAgilityPack (As mentioned by Giannis http://htmlagilitypack.codeplex.com/). Using a web browser control is too much for this task:
HtmlAgilityPack.HtmlWeb web = new HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = web.Load("http://www.google.com");
var el = doc.GetElementbyId("title");
string s = el.InnerHtml; // get the : <b>Title:</b> **Developers**
I haven't tried this code but it should be very close to working.
There must be an InnerText in HtmlAgilityPack as well, allowing you to do this:
string s = el.InnerText; // get the : Title: **Developers**
You can also remove the Title: by removing the appropriate node:
el.SelectSingleNode("//b").Remove();
string s = el.InnerText; // get the : **Developers**
If for some reason you want to stick to the web browser control, I think you can do this:
var el = webBrowser1.Document.GetElementById("title");
string s = el.InnerText; // get the : Title: **Developers**
UPDATE
Note that the //b above is XPath syntax which may be interesting for you to learn:
http://www.w3schools.com/XPath/xpath_syntax.asp
http://www.freeformatter.com/xpath-tester.html

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Can't using HtmlDocument in C# - c#

Related

How to extract a specific line from a webpage in c#

How to select html element with namespace with Fizzler / HtmlAgilityPack?

Can I scrape an HTML class without using a 3rd party library like HtmlAgilityPack?

Fetching data from a web page to a C# application

How to get an element using c#

Categories

Resources