How to select html element with namespace with Fizzler / HtmlAgilityPack? - c#

I am using Fizzler / HtmlAgilityPack to parse and extract elements from ASP.NET page file. In the asp.net file, we also use Telerik controls, e.g.
<telerik:RadGrid ... >
To extract this element , I used the methods below but not success. Can someone help on this please?
method#1:
HtmlDocument document = .....;
document.SelectNodes("telerik:RadGrid");
and it throws exception below:
Then I tried method#2:
XPathNavigator navigator = document.CreateNavigator();
var manager = new XmlNamespaceManager(navigator.NameTable);
manager.AddNamespace("telerik", "http://www.telerik.com");
var expr = XPathExpression.Compile("RadGrid");
expr.SetContext(manager);
var grids = document.DocumentNode.SelectNodes(expr);
There is no exception again. But grids is null even though the asp.net page contains markup of telerik:RadGrid.

It could be that your xpath is incorrect.
Please try this //*[name()='telerik:RadGrid'] as a namespace, it should work for elements with XML Namespace.

Related

How to extract a specific line from a webpage in c#

HttpWebRequest myReq = (HttpWebRequest)WebRequest.Create("https://www.google.com/search?q=" + "msg");
HttpWebResponse myres = (HttpWebResponse)myReq.GetResponse();
using (StreamReader sr = new StreamReader(myres.GetResponseStream()))
{
pageContent = sr.ReadToEnd();
}
if (pageContent.Contains("find"))
{
display = "done";
}
currently what this code does is check if "find" exists on a url and display done if it is present
What I want is to display the whole line or para which contains "find".
So like instead display="done" I want to store the line which contains find in display
HTML pages don't have lines. Whitespace outside tags is ignored and an entire minified page may have no newlines at all. Even if it did, newlines are simply ignored even inside tags.That's why <br> is necessary. If you want to find a specific element you'll have to use an HTML parser like HTMLAgilityPack and identify the element using an XPath or CSS selector expression.
Copying from the landing page examples:
var url = $"https://www.google.com/search?q={msg}" ;
var web = new HtmlWeb();
var doc = web.Load(url);
var value = doc.DocumentNode
.SelectNodes("//div[#id='center_col']")
.First()
.Attributes["value"].Value;
What you put in SelectNodes depends on what you want to find.
One way to test various expressions is to open the web page you want in a browser, open the browser's Developer Tools and start searching in the Element inspector. The search functionality there accepts XPath and CSS selectors.

Can't using HtmlDocument in C#

I create Console Application using Selenium to get the text from a table.
Tried with code:
IList<IWebElement> tableRows = browser.FindElementsByXPath("id('column2')/tbody/tr");
var doc = new HtmlDocument();
doc.LoadHtml(tableRows);
This error like:
'HtmlDocument' does not contain a constructor that takes 0 arguments
I read this answer from question
Almost people in Stackoverflow can be using like:
new HtmlDocument.
Why I can't be using this. I tried with Winform Application, but I also can't using HtmlDocument.
And HtmlDocument seems only LoadHmtl(String). But my code is IList<IWebElement>.
I don't know how to convert it to HTML string to add to doc.
IWebElement table = browser.FindElement(By.Id("column2");
var doc = new HtmlDocument();
doc.LoadHtml(table.InnerHtml);
first off all you can get the table elements using selenium... , if you chose to use agility pack you need to send to LoadHtml method string variable with html source so what you need to do is to find the html block (in your case is the table) take it as IWebElement and send it to LoadHtml using table.InnerHtml
also you can send the full page source doc.LoadHtml(driver.PageSource);

How to get an element using c#

I'm new with C#, and I'm trying to access an element from a website using webBrowser. I wondered how can I get the "Developers" string from the site:
<div id="title" style="display: block;">
<b>Title:</b> **Developers**
</div>
I tried to use webBrowser1.Document.GetElementById("title") ,but I have no idea how to keep going from here.
Thanks :)
You can download the source code using WebClient class
then look within the file for the <b>Title:</b>**Developers**</div> and then omit everything beside the "Developers".
HtmlAgilityPack and CsQuery is the way many people has taken to work with HTML page in .NET, I'd recommend them too.
But in case your task is limited to this simple requirement, and you have a <div> markup that is valid XHTML (like the markup sample you posted), then you can treat it as an XML. Means you can use .NET native API such as XDocument or XmlDocument to parse the HTML and perform an XPath query to get specific part from it, for example :
var xml = #"<div id=""title"" style=""display: block;""> <b>Title:</b> Developers</div>";
//or according to your code snippet, you may be able to do as follow :
//var xml = webBrowser1.Document.GetElementById("title").OuterHtml;
var doc = new XmlDocument();
doc.LoadXml(xml);
var text = doc.DocumentElement.SelectSingleNode("//div/b/following-sibling::text()");
Console.WriteLine(text.InnerText);
//above prints " Developers"
Above XPath select text node ("Developers") next to <b> node.
You can use HtmlAgilityPack (As mentioned by Giannis http://htmlagilitypack.codeplex.com/). Using a web browser control is too much for this task:
HtmlAgilityPack.HtmlWeb web = new HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = web.Load("http://www.google.com");
var el = doc.GetElementbyId("title");
string s = el.InnerHtml; // get the : <b>Title:</b> **Developers**
I haven't tried this code but it should be very close to working.
There must be an InnerText in HtmlAgilityPack as well, allowing you to do this:
string s = el.InnerText; // get the : Title: **Developers**
You can also remove the Title: by removing the appropriate node:
el.SelectSingleNode("//b").Remove();
string s = el.InnerText; // get the : **Developers**
If for some reason you want to stick to the web browser control, I think you can do this:
var el = webBrowser1.Document.GetElementById("title");
string s = el.InnerText; // get the : Title: **Developers**
UPDATE
Note that the //b above is XPath syntax which may be interesting for you to learn:
http://www.w3schools.com/XPath/xpath_syntax.asp
http://www.freeformatter.com/xpath-tester.html

Pull timer value from a webpage using xPath and C#

I am trying to pull some timer values off of websites using the xpath in the HtmlAgilityPack. However, when I am using the xpath, I get null reference exceptions because a particular node does not exist when I am grabbing it. To test why this was, I used a doc.Save to check the nodes myself, and I found that the nodes truly do not exist. From my understanding, HtmlAgilityPack should download the webpage almost exactly how I see it, with all the data in there as well. However, most of the data in fact is missing.
How exactly am I supposed to grab the timer values, or even an event title from either of the following websites:
http://dulfy.net/2014/04/23/event-timer/
http://guildwarstemple.com/dragontimer/eventsb.php?serverKey=108&langKey=1
My current code to pull just the title of the event from the first timebox from guildwarstemple is:
public void updateEventData()
{
//string Url = "http://dulfy.net/2014/04/23/event-timer/";
string Url = "http://guildwarstemple.com/dragontimer/eventsb.php?serverKey=108&langKey=1";
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load(Url);
doc.Save("c:/doc.html");
Title = doc.DocumentNode.SelectNodes("//*[#id='ep1']/p")[0].InnerText;
//*[#id="scheduleList"]/div[3]
//*[#id="scheduleList"]/div[3]/div[3]/text()
}
You XPath expression fails because there is only one div with #id='ep1' in the document, and it has no p inside:
<div id="ep1" class="eventTimeBox"></div>
In fact, all the divs in megaContainer are empty in the link you are trying to load with your code.
If you think there should be p elements in there, it's probably being added dynamically via JavaScript, so it might not be available when you are scraping the site with a C# client.
In fact, there are some JavaScript variables:
<script>
...
var e7 = 'ep1';
...
var e7t = '57600';
...
Maybe you want to get that data. This:
substring-before(substring-after(normalize-space(//script[contains(.,"var e7t")]),"var e7t = '"),"'")
selects the <script> which contains var e7t and extracts the string in the apostrophes. It will return:
57600
The same with your other link. The expression:
//*[#id="scheduleList"]
is a an empty div. You can't navigate further inside it:
<div id="scheduleList" style="width: 720px; min-width: 720px; background: #1a1717; color: #656565;"></div>
But this time there seems to be no nested JavaScript that refers to it in the page.

Explicit Element Closing Tags with System.Xml.Linq Namespace

I am using the (.NET 3.5 SP1) System.Xml.Linq namespace to populate an html template document with div tags of data (and then save it to disk). Sometimes the div tags are empty and this seems to be a problem when it comes to HTML. According to my research, the DIV tag is not self-closing. Therefore, under Firefox at least, a <div /> is considered an opening div tag without a matching closing tag.
So, when I create new div elements by declaring:
XElement divTag = new XElement("div");
How can I force the generated XML to be <div></div> instead of <div /> ?
I'm not sure why you'd end up with an empty DIV (seems a bit pointless!) But:
divTag.SetValue(string.Empty);
Should do it.
With
XElement divTag = new XElement("div", String.Empty);
you get the explicit closing tag
I don't know the answer to your question using LINQ. But there is a project called HTML Agility Pack on codeplex that allows you to create and manipulate HTML documents much similar to the way we can manipulate XML document using System.Xml namespace classes.
I did this. Working as expected.
myXml = new XElement("script", new XAttribute("src", "value"));
myXml .Value = "";
Which gives below as result.
<script src = "value"></script>

Categories